ML-Mastery basic tips
Prepare Data For Machine Learning
- Step 1: Data Selection Consider what data is available, what data is missing and what data can be removed.
- Step 2: Data Preprocessing Organize your selected data by formatting, cleaning and sampling from it.
- Step 3: Data Transformation Transform preprocessed data ready for machine learning by engineering features using scaling, attribute decomposition and attribute aggregation.
Evaluate Machine Learning Algorithms
A process to rapidly test algorithms and discover whether or not there is structure in your problem for the algorithms to learn and which algorithms are effective.
Spot-Checking Algorithms on your Machine Learning Problems
Spot-checking algorithms is about getting a quick assessment of a bunch of different algorithms on your machine learning problem so that you know what algorithms to focus on and what to discard.
There are 3 key benefits of spot-checking algorithms on your machine learning problems:
- Speed: You could spend a lot of time playing around with different algorithms, tuning parameters and thinking about what algorithms will do well on your problem. I have been there and end up testing the same algorithms over and over because I have not been systematic. A single spot-check experiment can save hours, days and even weeks of noodling around.
- Objective: There is a tendency to go with what has worked for you before. We pick our favorite algorithm (or algorithms) and apply them to every problem we see. The power of machine learning is that there are so many different ways to approach a given problem. A spot-check experiment allows you to automatically and objectively discover those algorithms that are the best at picking out the structure in the problem so you can focus your attention.
- Results: Spot-checking algorithms gets you usable results, fast. You may discover a good enough solution in the first spot experiment. Alternatively, you may quickly learn that your dataset does not expose enough structure for any mainstream algorithm to do well. Spot-checking gives you the results you need to decide whether to move forward and optimize a given model or backward and revisit the presentation of the problem.
Below are 5 tips to ensure you are getting the most from spot-checking machine learning algorithms on your problem.
- Algorithm Diversity: You want a good mix of algorithm types. I like to include instance based methods (live LVQ and knn), functions and kernels (like neural nets, regression and SVM), rule systems (like Decision Table and RIPPER) and decision trees (like CART, ID3 and C4.5).
- Best Foot Forward: Each algorithm needs to be given a chance to put it’s best foot forward. This does not mean performing a sensitivity analysis on the parameters of each algorithm, but using experiments and heuristics to give each algorithm a fair chance. For example if kNN is in the mix, give it 3 chances with k values of 1, 5 and 7.
- Formal Experiment: Don’t play. There is a huge temptation to try lots of different things in an informal manner, to play around with algorithms on your problem. The idea of spot-checking is to get to the methods that do well on the problem, fast. Design the experiment, run it, then analyze the results. Be methodical. I like to rank algorithms by their statistical significant wins (in pairwise comparisons) and take the top 3-5 as a basis for tuning.
- Jumping-off Point: The best performing algorithms are a starting point not the solution to the problem. The algorithms that are shown to be effective may not be the best algorithms for the job. They are most likely to be useful pointers to types of algorithms that perform well on the problem. For example, if kNN does well, consider follow-up experiments on all the instance based methods and variations of kNN you can think of.
- Build Your Short-list: As you learn and try many different algorithms you can add new algorithms to the suite of algorithms that you use in a spot-check experiment. When I discover a particularly powerful configuration of an algorithm, I like to generalize it and include it in my suite, making my suite more robust for the next problem.
Improve Machine Learning Results
When tuning algorithms you must have a high confidence in the results given by your test harness. This means that you should be using techniques that reduce the variance of the performance measure you are using to assess algorithm runs. I suggest cross validation with a reasonably high number of folds.
- Algorithm Tuning where discovering the best models is treated like a search problem through model parameter space. The more tuned the parameters of an algorithm, the more biased the algorithm will be to the training data and test harness. This strategy can be effective, but it can also lead to more fragile models that overfit your test harness and don’t perform as well in practice.
- Ensembles where the predictions made by multiple models are combined.
- 1. Bagging: Known more formally as Bootstrapped Aggregation is where the same algorithm has different perspectives on the problem by being trained on different subsets of the training data.
- 2. Boosting: Different algorithms are trained on the same training data.
- 3. Blending: Known more formally as Stacked Generalization or Stacking is where a variety of models whose predictions are taken as input to a new model that learns how to combine the predictions into an overall prediction.
- Extreme Feature Engineering where the attribute decomposition and aggregation seen in data preparation is pushed to the limits. Technically, what you are doing with this strategy is reducing dependencies and non-linear relationships into simpler independent linear relationships.
댓글