How to Structure Your Machine Learning Projects – Part 3 - Model Testing & Evaluation

Overview

How do we know when a model has learned? The theoretical examinations of this question go both wide and deep, but as a practical matter, what becomes important for the programmer is the ability of a classification model to make accurate distinctions between the target classes. However, making accurate distinctions is not always the same as having a highly accurate set of classifications. If that statement doesn’t make a ton of sense to you, allow me to provide you an example:

You have a data set that is composed of classes A and B, with 90% of records being of class A, and 10% being class B. When you provide this data to your model, it shows 90% accuracy, and you take this to be a good result, until you dig a little deeper into the process and find out that the classifier had given every single record a class A label. This means that even though the model completely failed to distinguish between the two classes, it still classified 90% of records accurately, because of the nature of the data set.

These sorts of examples are more common than you’d think, and they’re why we use a variety of different evaluation metrics when trying to comprehend the effectiveness of our models. In this post, we’ll go over a few of these metrics, as well as how they’re calculated, and how you can apply them both within and across classes.

Our Base Values

The first thing we have to do is create some more robust ways of defining our model’s actions than simply ‘correct classification’ and ‘incorrect classification’. To that end, the following values are calculated for each class as the model runs through the testing data:

True Positives (TP): The classifier applies label X, and the record was of class X.
False Positives (FP): The classifier applies label X, and the record was not of class X.
True Negatives (TN): The classifier applies any label that is not X, and the record was not of class X.
False Negatives (FN): The classifier applies any label that was not X, and the record was of class X.

As I said, these values are calculated for each class in your problem, so if, for example, a record is classified as class A, and its actual label was for class B, that would be +1 to Class A’s False Positives, and +1 to class B’s False Negatives. If you have a multi-class dataset, the same rules apply. In that example, if you had a class C as well, you would also add +1 to class C’s True Negatives, as the record was accurately not classified as belonging to C.

Useful Metrics

These four values allow us to get a much more detailed picture of how our classifier is performing. It is still possible to get an accuracy score for each class, by adding all the True Positives and True Negatives and dividing by the total number of records. However, you can also calculate many other metrics with these values. For the purposes of this post, we’re going to focus on just three: precision, recall, and F1 measure.

Precision (TP / (TP + FP)) is a metric that shows how frequently your classifier is correct when it chooses a specific class. The numerator – True Positives – is the number of records correctly classified as the given class, and the denominator – True Positives plus False Positives – is the number of times your classifier assigned that class label, whether correct or incorrect. With this metric, you can see how frequently your model is misclassifying a record by assigning it this particular class. A lower precision value shows that the model is not discerning enough in assigning this class label.

Recall (TP / (TP + FN)) is a metric that shows how frequently your classifier labels a record of the given class correctly. The numerator – True Positives – is the number of records correctly classified as the given class, and the denominator – True Positives plus False Negatives – is the number of records that should have been classified as the given class. With this metric, you can see what percentage of the target records your classifier is able to correctly identify. A lower recall value shows that the model is not sensitive enough to the target class, and that many records are being left out of the classification.

Finally, F1 measure (2 * ( (recall * precision) / (recall + precision))) is a combined score of recall and precision that gives a single measurement for how effective your classifier is. F1 score is most useful when trying to determine if a tradeoff of recall or precision for the other is increasing the general effectiveness of your model. You should not use F1 score as your only metric for model evaluation. Delving into your model’s specific precision and recall will give you a better idea of what about your model actually needs improving.

Micro/Macro Metrics

If your problem is such that you only care about a single target class, then it’s easy to stop at the evaluation of your model as above. However, for multi-class problems, it’s important to have a calculation to show the model’s general effectiveness across all classes, as opposed to each class individually. There are two ways to do this, both with their advantages and disadvantages.

The first is known as macro-averaging, which computes each metric for each class first, and then takes an average of those values. For example, if you have three classes, with precision 0.6, 0.7, and 0.2, you would add those values up to 1.5 and divide by 3 to get a macro-precision of 0.5.

Micro-averaging on the other hand takes all the values that would go into each individual metric and then calculates a single value based on those values. This can be a little confusing, so allow me to provide an example. For consistency’s sake, let’s use score values that yield the same precision values as above: your data could have class A with TP = 6, FP = 4; class B with TP = 3, FP = 7; and class C with TP = 20, FP = 100. This would give you the 0.6, 0.7, and 0.2 precision as above, but performing a micro-averaging, which means adding all the individual values for each class, though it were one class, (all TP / (all TP + all FP)) you get a micro-precision of 0.261.

This is much lower than the 0.5 macro-precision, but this example should not bias you away from one metric or towards another. There are times when either metric might give you more insight into the effectiveness of your classifier, and so you must use your judgment when choosing what metrics to pay attention to.

Conclusion

Building a complete picture of your model’s effectiveness takes more than just looking at the number of misclassified records, and we should be glad of that. As you delve into the various metrics available to you as a data scientist, you can begin to see patterns forming, and use those experiments and your intuitions to build better and more powerful models through the process of tuning, which we will cover in our next blog post.

Related Posts

Demystifying Data Governance: Implementing Policy

Demystifying Data Governance: The Right People for the Job

Leveraging AI to Keep Your Customers Happy

Stay in Touch