One of the things you learn quickly while working in machine learning is that no “one size fits all.” When it comes to choosing a model, this is truer than ever. However, there are some guidelines one can follow when trying to decide on an algorithm to use, and while we’re on the subject of model creation, it’s useful to discuss good practices and ways in which you can fall into unforeseen mistakes.
Before we dig into the topic of model selection, I want to take a moment to address an important idea you should be considering when designing the structure of your project: cross-validation. Cross-validation is essentially a model evaluation process that allows you to check the effectiveness of your model on more than one data set by building multiple training sets from your existing data. This is done by moving a testing data ‘fold’ through your data, building models with what isn’t set aside for training.
For example: let’s say you have 100 records, and want to use 10 ‘fold’ cross-validation, essentially building 10 distinct models. For the first model, you might use the first 10 records for your testing data, and then the next 90 records as training data. Once you’ve built a model with that training data and tested it on that testing data, you move the testing fold down to the 11th-20th records, and use the 1st-10th and 21st-100th records combined together as your training data. This process repeats, moving the testing fold through the dataset, until you have 10 distinct models and 10 distince results, which will give you a more robust picture of how well your process has learned than if you just built one model with all your data.
Cross-validation is mostly a straightforward process, but there are a couple of things to watch out for while you’re performing it. The first possible issue is the introduction of bias into your testing data. You have to be careful with the data transformations that you perform while using cross-validation; for example, if you’re boosting the presence of a target class by adding new rows of that class’ data, you have to make sure that the boosting occurs after you perform the testing-training split. Otherwise, the results of your testing will appear to be better than they would be otherwise, since your training data will be tainted with new, targeted data.
Another thing to consider is whether to use stratified cross-validation or not. Stratified cross-validation is an enhancement to the cross-validation process where instead of using the arbitrary order of your data set or a random sampling function to build your testing and training data, you sample from your whole data set equal to a similar proportional representation of each class as is in your whole data set. For example, if your data is 75% class A and 25% class B, stratified cross-validation would attempt to make testing and training samples that maintain that balance of classes. This has the benefit of more accurately depicting the nature of your original problem than using a random or arbitrary system.
Concerns in Selection
The major topic to think about when deciding what machine learning model to use is the shape and nature of your data. Some of the high level questions you might ask yourself: Is this a multi-class or binary class problem? (Note, if you only care about a single target class within a multi-class dataset, it’s possible to treat it as a binary class problem.) What is the distribution of classes? If the distribution is highly uneven, you may want to avoid certain types of models, such as decision tree based models, or consider boosting the presence of the under-represented class.
Another major question is whether your data is linearly separable or not. While it’s rare that the complicated datasets you will encounter are truly ‘linearly’ separable, datasets that contain more clearly defined class boundaries are good candidates for models such as support vector machines. That being said, it can be difficult to get a good picture of your dataset if it has a high number of features (also known as high dimensionality). In this case, there are still ways to map your dataset in a 2D plane, and it can be highly useful to do so as an initial step before model selection, in order to give you new insights into your data set. Rather than detail the approaches here, here is a link to a post by Andrew Norton which details how you can use the matplotlib library in python to visualize multi-dimensional data.
One of the final considerations that you have to make when selecting your model is the size of your data, both in terms of volume and dimensionality. Obviously, as these variables increase, so will the runtime of your model training, but it’s worth noting that there are models that will build relatively quickly – such as a Random Forest algorithm – and models that as your data gets larger and larger will become prohibitively slow – such as many neural network implementations. Make sure that you understand your data, your hardware resources, and your expectations of runtime before you start learning and working on a new training algorithm.
Concerns in Construction
When it comes to actually building your models, there’s nothing stopping you from just plugging your data right into your machine learning library of choice and going off to the races, but if you do, you may end up regretting it. It’s important to realize as you’re building the framework for your project that everything – from your number of cross-validation folds to aspects of your pre-processing to the type of model itself – is not only subject to change as you experiment, but also is highly likely to do so.
For that reason, it’s more critical than ever that you write modular, reusable code. You will be making changes. You will want to be able to pass a range of values to any given aspect of your code, such as the percentage of features to select from your raw data set. Make your life easier by starting the project with different pieces in different functions, and any values that may need to be updated as testing happens used as function parameters.
A similar concept applies to flow controls. It may be that you want to be able to turn on or off your feature selection functionality, or your class boosting, or switch quickly between different models. Rather than having to copy-paste or comment out large chunks of code, simply set up an area at the beginning of your scope with Boolean values to control the different aspects of your program. Then, it’ll be a simple change from True to False to enable or disable any particular part of your process.
I hope this post has given you some insights into things to think about before starting the construction of your machine learning project. There are many things to consider before any actual testing begins, and how you answer these questions and approach these problems can make that process either seamless or very frustrating. All it takes are a few good practices and a bit of information gathering for you to be well on your way to unlocking the knowledge hidden in your data.