Overview
Getting started on a machine learning project is always a challenge. There’s lots of questions to answer, and frequently, you don’t even know what questions to ask. In this post, and the four others linked to in their respective sections, I hope to explain the fundamentals of building a machine learning project from the ground up, what kind of choices you might have to make, and how best to make them.
For the purposes of this blog post, we’re going to be focusing mostly on supervised learning models – ones that learn with labeled data – as opposed to unsupervised. While most of what we’ll talk about can be applied to both sorts of problems, unsupervised learning has some quirks that are worth talking about separately, which won’t be addressed in this overview.
Preprocessing
The very first part of any machine learning project is cleaning and reformatting the data, also known as pre-processing. Raw data is almost never ready to go straight from the source into your training program.
The first reason for this is that raw data is frequently very messy. This can mean that there are values missing, or mis-labeled, and they might need to be replaced. Our first decision has now cropped up: what value do we want to substitute in for the ones that are missing? One option is to use the average, if the values are numerical; another is to use the most common, if the values are string-based. Regardless of what you choose, the shape of the data will be impacted, and it’s usually worth trying multiple approaches. Other forms of data cleaning include removing extraneous features, or removing outlier records, in order to decrease the amount of noise in the dataset.
A second reason for pre-processing is that frequently data will need to be reformatted before it is usable by a machine learning library. This often leads to the processes of categorical encoding – changing string-like values to continuous (AKA numerical) values – and numerical scaling. There are many different approaches to both these processes as well, each with their own benefits and tradeoffs. The most important thing to remember in pre-processing is that the choices you make now will impact the effectiveness of your model later, so serious consideration should be given to these questions when beginning your project.
For more information on pre-processing approaches, see our detailed blog post here.
Model Selection and Creation
With the data cleaned up and ready to go, it’s time to pick a method/algorithm to use to build your models. There is rarely a “right” answer when it comes to model selection; as with most things in machine learning, there are only good questions, reasonable approaches, and always some trade-offs.
Some of the big questions to consider when selecting a model are things like: Do I believe my data is linearly separable? What impact will the distribution of classes in my data set have on my model (ie, is the data biased heavily towards one class)? Do I need my model to support multi-class classification, or just a binary classification? How large is my dataset – in terms of both records and features – and how is that going to affect the runtime of my model? The answers you come up with to these questions might point you in the direction of any number of different models, but the key is to not think narrowly in terms of model selection. We’ll discuss how these questions might relate to specific machine learning models in a future post.
Before we continue though, it’s important to discuss a topic that will affect the flow of your project in a significant way: cross-validation. Cross-validation is the concept of building multiple models with different cross sections of your data set in order to determine the validity of your results. By using a subsection of your training data as your initial test data, it allows you to check the performance of your model before you start working with actual unlabeled data. The cross sections are generally created by determining a number of desired folds, n, and using 1/n records as your testing data, and n-1/n records as the training data. Cross-validation can be further enhanced by using a stratified cross-validation process, which takes equal portions of each class as it builds the training sets, instead of a random sample.
For more information on model selection and cross-validation, see our detailed blog post here.
Model Testing and Metrics
Once you have your models built, how do you know whether or not they’ve learned anything? This question is harder to answer than it looks. One might want to go on the model’s ability to “accurately” predict testing records. However, this allows for the following example: your model predicts that a given record will be of class A 100% of the time, in a binary classification problem. This is not a good classifier. However, when you give your training data to the model, it will be able to show 90% accuracy, if – for whatever reason – your testing data is 90% class A. Your model completely failed to “learn” any other classes, but in this case, the accuracy is still very high. These sort of thought experiments show why it’s important to look at a more complex system of metrics to determine a model’s quality.
This can most easily be done by calculating the following features: True Positives, False Positives, True Negatives, and False Negatives. These values keep track of how your model’s predictions align with the actual labels of your testing data. So, if a record is of class B, and your model predicts that it is of class A, you add one to the count of False Positives for class A (your model incorrectly classified as A), and one to the count of False Negatives for class B (your model failed to correctly classify as B).
These numbers can then be used to show accuracy (True Positives + True Negatives / number of records), your model’s effectiveness at capturing a target class, aka “recall” (True Positive / True Positive + False Negative), and how accurate your model is at predicting a target class, aka “precision” (True Positive / True Positive + False Positive). These are only a few of the ways to evaluate model performance, but they frequently the most useful.
For more information on model testing and evaluation metrics, see our detailed blog post here.
Model Improvement Methods
So now we have clean data, multiple models, and ways to peer at what’s going on inside your models (or at least know the quality of what’s coming out), but there’s still one critical step to go: tuning your models.
The exact process of this will depend on the particular algorithm that you’re using, but a good place to start is by choosing a handful of large ranges for your model’s parameters, and narrowing from there via the results of your tests. However, you should not take this advice to mean that you can expect linear relationships from your parameters; in fact, you should not expect this at all. The purpose is more to start your testing broadly, and then narrow your scope as you continue. You should not be afraid to retread old possibilities if something doesn’t work out.
Aside from tuning the parameters of your specific model, there are a handful of general approaches that you can use as well. One of these is boosting, which involves increasing the volume or the weight of your target class or classes. It is important to perform this after splitting your data for cross-validation, or else you will contaminate your model creation with bias. You should not boost your testing data.
Other processes mentioned before, such as outlier removal, feature selection, and changing the amount of cross-validation can also improve the quality of your models. For more information on model improvement methods, see our detailed blog post here.
Conclusion
Hopefully this general outline – and the posts to come – have given you a good starting framework for how to tackle your machine learning problems. There exists much more out there in the world of ML to be sure, but with solid fundamentals, you can start unlocking the secrets hidden in your data and use them to empower your decisions and applications.