Spark SQL (Spark) provides powerful data querying capabilities without the need for a traditional relational database. While AWS Elastic Map Reduce (EMR) service has made it easier to create and manage your own cluster, it can still be difficult to set up your data and configure your cluster properly so that you get the most out of your EC2 instance hours.

This blog will discuss some of the lessons that we learned when setting up our Spark. Hopefully we can spare you some of the time we experienced when setting ours up! While we believe these best practices should apply to the majority of users, bear in mind that each use case is different and you may need to make small adaptations to fit your project.



First Things First: Data Storage

Unsurprisingly, one of the most important questions when querying your data is how that data is stored. We learned that Parquet files perform significantly faster than CSV file format. Parquet is a columnar data storage format commonly used within the Hadoop ecosystem. Spark reads Parquet files much faster than text files due to its capability to perform low-level filtration and create more efficient execution plans. When running on 56 million rows of our data stored as Parquet instead of CSV, speed was dramatically increased while storage space was decreased (see the table below).

Storing Data as Parquet instead of CSV

Partitioning the Data

Onto the next step: partitioning the data within your data store. When reading from S3, Spark takes advantage of partitioning systems and requests only the data that is required for the query. Assuming that you can partition your data along a frequently used filter (such as year, month, date, etc.), you can significantly reduce the amount of time Spark must spend requesting your dataset. There is a cost associated with this, though – the data itself must be partitioned according to the same scheme within your S3 bucket. For example, if you’re partitioning by year, all of your data for 2017 must be located at s3://your-bucket/path/to/data/2017, and all of your data for 2016 must be located at s3://your-bucket/path/to/data/2016, and so on.

You can also form multiple hierarchical layers of partitioning this way (s3://…/2016/01/01 for January 1st, 2016). The downside is that you can’t partition your data in more than one way within one table/bucket and you have to physically partition the data along those lines within the actual filesystem.

Despite these small drawbacks, partitioning your data is worth the time and effort. Although our aggregation query test ran in relatively the same time on both partitioned and unpartitioned data, our filtration query ran 7x faster, and our table scan test ran 6x faster.

Caching the Data

Next, you should cache your data in Spark when you can fit it into the RAM of your cluster. When you run the command ‘CACHE TABLE tablename’ in Spark, it will attempt to load the entire table into the storage memory of the cluster. Why into the storage memory? Because Spark doesn’t have access to 100% of the memory on your machines (something you likely would not want anyways), and because Spark doesn’t use all of the accessible memory for storing cached data. Spark on Amazon EMR is only granted as much RAM as a YARN application, and then Spark reserves about one third of the RAM that it’s given for execution memory, (i.e. temporary storage used while performing reads, shuffles, etc.). As a rule of thumb, the actual cache storage capacity is up to 1/3 of the total memory in your RAM.

Furthermore, when Spark caches data, it doesn’t use the serialization options available to a regular Spark job. As a result, the size of the data actually increases by about two to three times when you load it into memory. Any overflow that doesn’t fit into Spark storage does get saved to disk, but as we will discuss later, this leads to worse performance than if you read the data directly from S3.


Configuring Spark – The Easy Way and the Hard (the Right) Way 

The Easy Way

Spark has a bevy of configuration options to control how it uses the resources on the machines in your cluster. It is important to spend time with the documentation to understand how Spark executors and resource allocation works if you want to get the best possible performance out of your cluster. However, when just starting out, Amazon EMR’s version of Spark allows you to use the configuration parameter “maximizeResourceAllocation” to do some of the work for you. Essentially, this configures your cluster to make one executor per worker node, with each executor having full access to all the resources (RAM, cores, etc.) on that node. To do this, simply add “maximizeResourceAllocation true” to your “/usr/lib/spark/conf/spark-defaults.conf” file.

While this may be an ‘acceptable’ way to configure your cluster, it’s rarely the best way. When it comes to cluster configuration, there are no hard and fast rules. How you want to set up your workers will come down to the hardware of your machines, the size and shape of your data, and what kind of queries you expect to run. That being said, there are some guidelines that you can follow.

 The Right Way: Guidelines

First, each executor should have between two and five cores. While it’s possible to jam as many cores into each executor as your machine can handle, developers found that you start to lose efficiency with a core count higher than five or six. Also, while it is possible to make an individual executor for each core, this causes you to lose the benefits of running multiple tasks inside a single JVM. For these reasons, its best to keep the number of cores per executor between two and five, making sure that your cluster as a whole uses all available cores. For example, on a cluster with six worker machines, each machine having eight cores, you would want to create twelve executors, each with four cores. The number of cores is set in spark-defaults.conf, while the number of executors is set at runtime by the “–num-exectors #” command line parameter.

Memory is the next thing you must consider when configuring Spark. When setting the executors’ memory, it’s important to first look at the YARN master’s maximum allocation, which will be displayed on your cluster’s web UI (more on that later). This number is determined by the instance size of your workers and limits the amount of memory each worker can allocate for YARN applications (like your Spark program).

Your executor memory allocation needs to split that amount of memory between your executors, leaving a portion of memory for the YARN manager (usually about 1GB), and remaining within the limit for the YARN container (about 93% of the total).

The following equation can be used for shorthand:

One Last Thing – the Spark Web UI

Earlier in this post we mentioned checking the YARN memory allocation through the Spark Web UI. The Spark Web UI is a browser-based portal for seeing statistics on your Spark application. This includes everything from the status of the executors, to the job/stage information, to storage details for the cluster. It’s an incredibly useful page for debugging and checking the effects of your different configurations. However, the instructions on Amazon’s page for how to connect are not very clear.

In order to connect, follow Amazon’s instructions – located here –  but in addition to opening the local port 8157 to a dynamic port, open a second local port to port 80 (to allow http connections), as well as a third local port to port 8088. This port – the one connected to your cluster’s port 8088 – will give you access the YARN resource manager, and through that page, the Spark Web UI.

Note: the Spark Web UI will only be accessible when your Spark application is running. You can always access the YARN manager, but it will show an empty list if there is no Spark application to display.


Now it’s Your Turn!

Unlocking the power of Spark SQL can seem like a daunting task, but performing optimizations doesn’t have to be a major headache. Focusing on data storage and resource allocation is a great starting point and will be applicable for any project. We hope that this post will be helpful to you in your SparkSQL development, and don’t forget to check out the Spark Web UI!


Hey there! This summer, dbSeer has been keeping pretty busy. We completed a database migration project with one of our customers, Subject7, and then turned it into a case study to share with our great supporters like you.

In the project, our certified AWS architects (and all-around awesome people) designed a new network architecture from the ground-up and moved 50 database instances to Amazon RDS. They did all this while still reducing Subect7’s costs by 45%. If that’s not amazing, tell me what is…I’m waiting.

We know you want to learn more, so you can see the full case study here.


If you’re short on time, check out below to see the project at a glance:

Who was the client?

Subject7, who created a no-coding, cloud-based automated testing solution for web and mobile applications.

What was the opportunity?

Subject 7 sought to enhance their back-end architecture with the most optimal resource allocation to prepare for future expansion.

What was dbSeer’s solution?

dbSeer designed a new network architecture from the ground-up, which included moving to Amazon RDS. Once on AWS, dbSeer found the most optimal resource allocation.

What were the results?

dbSeer migrated nearly 50 database instances to RDS with minimum downtime. Subject 7 is now able to scale the back-end server to any size without impact to users. AWS costs decreased by nearly 45% and Subjet7 achieved a positive ROI in only 2 months.


If you’re interested in learning more, or have specific questions, or just want to say hi, we always love connecting with our readers. Don’t hesitate to reach out, which you can do here.

Data Architecture

We’ve got some pretty exciting news to share. Earlier this year, we completed a Migration and Big Data project with one of our customers, the market leader in telecom interconnect business optimization. The project went so well that we made a case study to showcase what we did.

In a nutshell, we demonstrated how they could take advantage of the elasticity and scalability of AWS services to support market expansion. We did this by re-architecting their event processing engine and leveraging AWS elastic services and open source-technologies to provide unlimited scalability. As a result of our work, they increased their processing speeds by 60x at a fraction of the cost.

If you want more details on this project (and trust me, you do) check out this link (Delivering-on-the-AWS-Promise-Migration-Case-Study).

You can learn all you’ve ever wanted about the opportunity client presented, the solution we designed, and the results we earned.


Amazon RDS is a managed relational database service that provides multiple familiar database engines to choose from (Amazon Aurora, MySQL, MariaDB, Oracle, Microsoft SQL Server, and PostgreSQL). Amazon RDS handles routine database tasks such as provisioning, patching, backup, recovery, failure detection, and repair.

Compared to the hosted databases, RDS is easy to use and the admin effort is very low. Increasing the performance and storage is easy. Monitoring, daily backups and restores can be configured easily.

Existing hosted databases can easily be migrated to AWS using AWS Migration Service. This service supports homogeneous migrations such as Oracle to Oracle, as well as heterogeneous migrations between different database platforms, such as Oracle to Amazon Aurora or Microsoft SQL Server to MySQL.


RDS Configuration

Above architecture diagram, shows a proposed AWS Architecture for an Enterprise Web Applications.



AWS recommends having your application inside a VPC (Virtual Private Cloud). For a multi-tiered web application, it is recommended to have a private and public subnet within the VPC. The database server should be launched in the private subnet, so that it is isolated and secure. The webservers should be launched in the public subnet. Security and routing needs to be configured so that only the web servers can communicate with the database servers in the private subnet. Since the web application is public facing, there would be an internet route configured for the public subnet.


Multi-AZ Deployment

AWS RDS allows multi-AZ deployment to support high availability and reliability.  With this feature, AWS automatically provisions and maintains a synchronous standby replica in a different Availability Zone. AWS synchronously replicates the data from the primary to the secondary database instance. If the primary database instance goes down for any reason AWS will automatically fail over to the secondary database instance.


Read Replicas

Read Replicas, can help you scale out beyond the capacity of a single database deployment for read-heavy database workloads. Updates made to the source DB instance are asynchronously copied to the Read Replica. This mechanism would be very useful in case you have a web application and reporting application both using the same database instance. In this scenario, all read only traffic would be routed to the read replicas. The primary database would be used for read and write traffic for the web application.


Backup and Maintenance

AWS automatically creates backups of the RDS instance. Amazon RDS creates a storage volume snapshot of your DB instance, backing up the entire DB instance. To reduce performance impact, backups and maintenance should be configured when application usage is low.


Managing web traffic is a critical part of any web application, and load balancing is a common and efficient solution. Load balancing distributes workloads, aiming to maximize throughput, minimize response time and avoid overloading a single resource. Using auto-scaling in combination with load balancing allows for your system to grow and shrink its distributed resources as necessary, providing a seamless experience to the end user. The scalability and elasticity features of AWS can be easily utilized by any web application built on AWS.

Considerations on setting up a Logi App on AWS

Applications built using Logi have shared components such as the Cached Data Files, Bookmark Files. To provide the seamless experience to the end user, Logi needs to be configured correctly to share these common files across multiple instances as the system scales the resources.


To support the shared resources, a shared file system is needed which needs to be shared across the multiple instances as resources get allocated and deallocated. Amazon Elastic File System or EFS is the shared file system on AWS. Currently EFS is only supported on Linux based instances and is not supported on Windows.

To support auto scaling, shared file locations need to be defined in the application settings so that new resources are pre-configured when auto scaling allocates new resources. Whenever you add a new server to the auto-scaling group, it has to be pre-configured with both your Logi application and the correct connections to the distributed resources.

Recommendations with Logi Apps on AWS

The solution to the EFS challenge on Windows Logi Apps involves adding a middle layer of Linux based EC2 instances to the architecture. Mount the EFS volumes on the Linux Instances, and these can be accessed on the Windows servers via the SMB protocol. By adding these Linux instances – and an associated load balancer – it becomes possible to use an EFS volume despite the Windows operating system being unable to mount it directly.

Use of Amazon Machine Images (AMIs) allows the developer to create a single instance with the Logi web app, the server and user settings for accessing the EFS, and their specific Logi application. This AMI can then be used by the auto scaling group to allocate and deallocate new instances. With the setting and report definitions saved in the AMI, and the shared elements such as Bookmarks and data cache saved on the EFS, it becomes possible to use AWS load balancers to implement a distributed and scalable Logi web application.

A detailed, step-by-step guide outlining all the technical details on how a Logi application can be configured to harness the scalability & elasticity features of AWS can be found here.

Self Service

Self-service business intelligence (BI) is all the rage. If Google analytics is any indication, it wasn’t until January 2015 that the keyword “self-service BI” appeared consistently on its search radar. There are a lot of claims that point toward self-service BI being the magic bean for business users trying to make sense of huge data volumes.

What self-service can resolve (in a way):

Self ServiceThe stalk rising from this magic bean is a fast track highway that enables business users to help themselves to information they need without any dependency on more tech savvy folks, or IT departments. Tableau and Qlik are examples of companies who claim to fulfill this highly sought need. No more waiting for IT to pull reports, no more waiting on tech savvy developers and coders to decipher volumes of business intelligence data that you have accumulated–but have no clue how to digest. Now, you can just go in yourself and pull beautiful visualizations that turns terabytes of raw data into meaningful, and presentable, information any business person can digest.

What story is the data is telling? Self-service models fall short in answering this question.

What self-service mostly does not resolve:

As convenient as it may be, self-service BI is insufficient when on its own. Yes, it resolves the issue of dependency, but in what way? In one of our white papers on, we discuss the ideal framework for maturing your analytics platform. It speaks to two approaches: the descriptive versus the diagnostic approach to understanding analytics.

The diagnostic approach & its insufficiencies

Self-service BI can easily fulfill a diagnostic approach. (For more on the framework for understanding your analytics, see this paper). In the diagnostic approach, you can slice and dice the data on your own. However, the descriptive approach that answers the where, what, and hows of your data cannot be so easily fulfilled in this way. What is the story the data is telling? Self-service models fall short in answering this question. After all, there is an expertise data scientists have to manipulate and extract information from data that many end-users may not have. The opportunity to unveil and attain these findings is lost if and when business end-users purely rely on self-service offerings.

The Better Solution:

In no way am I suggesting we get rid of self-service BI. There is a well-established need for it and SaaS vendors in the BI world should definitely offer it. However, the better solution is to use self-service solutions as the exception, and not the rule for your business intelligence needs. There is a lot of business intelligence to be attained that self-service solutions are incapable of unearthing.

Jack set his eyes on what was not his for the taking when he climbed the magic beanstalk, the golden egg, the harp, the coins. I’m not claiming you can’t have them (!), I’m just saying you should get the goods through the appropriate means, rather than necessarily helping yourself in the dark!



Machine learning algorithms, while powerful, are not perfect, or at least not perfect “right out of the box.” The complex math that controls the building of machine learning models requires adjustments based on the nature of your data in order to give the best possible results. There are no hard and fast rules to this process of tuning, and frequently it will simply come down to the nature of your algorithm and your data set, although some guidelines can be provided to make the process easier. Furthermore, while tuning the parameters of your model is arguably the most important method of improving your model, other methods do exist, and can be critical for certain applications and data sets.

Tuning Your Model

As stated above, there are no hard and fast rules for tuning your model. Which parameters you can adjust will come down to your choice of algorithm and implementation. Generally, though, regardless of your parameters, it’s desirable to start with a wide range of values for your tuning tests, and then narrow your ranges after finding baseline values and performance. It’s also generally desirable, in your first ‘round’ of testing, to test more than one value at once, as the interactions between different parameters can mean that as you adjust one, it only becomes effective within a certain range of values for another. The result is that some kind of comprehensive testing is needed, at least at first. If you’re using the sci-kit learn library, I highly recommend using their GridSearchCV class to find your baseline values, as it will perform all of the iterations of your parameters for you, with a default of three fold cross-validation.

Once you have your baseline, you can begin testing your parameter values one at a time, and within smaller ranges. In this round of testing, it’s important to pay close attention to your evaluation metrics, as your parameter changes will, hopefully, affect the various performance features of your model. As you adjust, you will find patterns, relationships, and learn more about the way in which your individual algorithm and data respond to the changes that you’re making. The best advice I can give is to test extensively, listen to your metrics, and trust your intuitions.

The next best advice I can give is don’t be afraid to start over. It may be that after running several rounds of fine tuning that your performance plateaus. Let yourself try new baseline values. The machine learning algorithms out there today are very complex, and may respond to your testing and your data in ways that only become clear after many failures. Don’t let it discourage you, and don’t be nervous about re-treading old ground once you’ve learned a little more from your previous experiments. It may end up being the key you need to find a breakthrough.

Alternate Methods – Boosting

Aside from tuning your model’s parameters, there are many other ways to try to improve your model’s performance. I’m going to focus on just a few here, the first of them being the concept of boosting. Boosting is the artificial addition of records to your data set, usually to boost the presence of an under-represented or otherwise difficult to detect class. Boosting works by taking existing records of the class in question and either copying them or altering them slightly, and then adding some desired number of duplicates to the training data. It is important that this process happens only with the training data and not the testing data when using cross-validation, or you will add biased records to your testing set.

It is important also only to use boosting on your data set up to a certain degree, as altering the class composition too drastically (for example, by making what was initially a minority class into the majority class) can throw off the construction of your models as well. However, it is still a very useful tool when handling a difficult to detect target class, imbalanced class distribution, or when using a decision tree model, which generally works best with balanced class distributions.

Alternate Methods – Bagging

This method of model improvement isn’t actually a way to improve your models, per se, but instead to improve the quality of your system. Bagging is a technique that can be used with ensemble methods, classification systems that employ a large number of weak classifiers to build a powerful classification system. The process of bagging involves either weighting or removing classifiers from your ensemble set based on their performance. As you decrease the impact of the poorly performing classifiers (“bagging” them), you increase the overall performance of your whole system. Bagging is not a method that can always be used, but it is an effective tool for getting the absolute most out of using an ensemble classification method.

Alternate Methods – Advanced Cleaning

As discussed in our pre-processing post, advanced pre-processing methods, such as feature selection and outlier removal, can also increase the effectiveness of your models.  Of these two methods, feature selection is frequently the easier and more effective tool, although if your data set is particularly noisy, removing outliers may be more helpful to you. Rather than reiterate what was said before, I recommend that you read our full post on these techniques, located here.


The process of model improvement is a slow but satisfying one. Proposing, running, and analyzing your experiments takes time and focus, but yields the greatest rewards, as you begin to see your models take shape, improve, and reveal the information hidden in your data.



How do we know when a model has learned? The theoretical examinations of this question go both wide and deep, but as a practical matter, what becomes important for the programmer is the ability of a classification model to make accurate distinctions between the target classes. However, making accurate distinctions is not always the same as having a highly accurate set of classifications. If that statement doesn’t make a ton of sense to you, allow me to provide you an example:

You have a data set that is composed of classes A and B, with 90% of records being of class A, and 10% being class B. When you provide this data to your model, it shows 90% accuracy, and you take this to be a good result, until you dig a little deeper into the process and find out that the classifier had given every single record a class A label. This means that even though the model completely failed to distinguish between the two classes, it still classified 90% of records accurately, because of the nature of the data set.

These sorts of examples are more common than you’d think, and they’re why we use a variety of different evaluation metrics when trying to comprehend the effectiveness of our models. In this post, we’ll go over a few of these metrics, as well as how they’re calculated, and how you can apply them both within and across classes.

Our Base Values

The first thing we have to do is create some more robust ways of defining our model’s actions than simply ‘correct classification’ and ‘incorrect classification’. To that end, the following values are calculated for each class as the model runs through the testing data:

  • True Positives (TP): The classifier applies label X, and the record was of class X.
  • False Positives (FP): The classifier applies label X, and the record was not of class X.
  • True Negatives (TN): The classifier applies any label that is not X, and the record was not of class X.
  • False Negatives (FN): The classifier applies any label that was not X, and the record was of class X.

As I said, these values are calculated for each class in your problem, so if, for example, a record is classified as class A, and its actual label was for class B, that would be +1 to Class A’s False Positives, and +1 to class B’s False Negatives. If you have a multi-class dataset, the same rules apply. In that example, if you had a class C as well, you would also add +1 to class C’s True Negatives, as the record was accurately not classified as belonging to C.

Useful Metrics

These four values allow us to get a much more detailed picture of how our classifier is performing. It is still possible to get an accuracy score for each class, by adding all the True Positives and True Negatives and dividing by the total number of records. However, you can also calculate many other metrics with these values. For the purposes of this post, we’re going to focus on just three: precision, recall, and F1 measure.

Precision (TP / (TP + FP)) is a metric that shows how frequently your classifier is correct when it chooses a specific class. The numerator – True Positives – is the number of records correctly classified as the given class, and the denominator – True Positives plus False Positives – is the number of times your classifier assigned that class label, whether correct or incorrect. With this metric, you can see how frequently your model is misclassifying a record by assigning it this particular class. A lower precision value shows that the model is not discerning enough in assigning this class label.

Recall (TP / (TP + FN)) is a metric that shows how frequently your classifier labels a record of the given class correctly. The numerator – True Positives – is the number of records correctly classified as the given class, and the denominator – True Positives plus False Negatives – is the number of records that should have been classified as the given class. With this metric, you can see what percentage of the target records your classifier is able to correctly identify. A lower recall value shows that the model is not sensitive enough to the target class, and that many records are being left out of the classification.

Finally, F1 measure (2 * ( (recall * precision) / (recall + precision))) is a combined score of recall and precision that gives a single measurement for how effective your classifier is. F1 score is most useful when trying to determine if a tradeoff of recall or precision for the other is increasing the general effectiveness of your model. You should not use F1 score as your only metric for model evaluation. Delving into your model’s specific precision and recall will give you a better idea of what about your model actually needs improving.

Micro/Macro Metrics

If your problem is such that you only care about a single target class, then it’s easy to stop at the evaluation of your model as above. However, for multi-class problems, it’s important to have a calculation to show the model’s general effectiveness across all classes, as opposed to each class individually. There are two ways to do this, both with their advantages and disadvantages.

The first is known as macro-averaging, which computes each metric for each class first, and then takes an average of those values. For example, if you have three classes, with precision 0.6, 0.7, and 0.2, you would add those values up to 1.5 and divide by 3 to get a macro-precision of 0.5.

Micro-averaging on the other hand takes all the values that would go into each individual metric and then calculates a single value based on those values. This can be a little confusing, so allow me to provide an example. For consistency’s sake, let’s use score values that yield the same precision values as above: your data could have class A with TP = 6, FP = 4; class B with TP = 3, FP = 7; and class C with TP = 20, FP = 100. This would give you the 0.6, 0.7, and 0.2 precision as above, but performing a micro-averaging, which means adding all the individual values for each class, though it were one class, (all TP / (all TP + all FP)) you get a micro-precision of 0.261.

This is much lower than the 0.5 macro-precision, but this example should not bias you away from one metric or towards another. There are times when either metric might give you more insight into the effectiveness of your classifier, and so you must use your judgment when choosing what metrics to pay attention to.


Building a complete picture of your model’s effectiveness takes more than just looking at the number of misclassified records, and we should be glad of that. As you delve into the various metrics available to you as a data scientist, you can begin to see patterns forming, and use those experiments and your intuitions to build better and more powerful models through the process of tuning, which we will cover in our next blog post.



One of the things you learn quickly while working in machine learning is that no “one size fits all.” When it comes to choosing a model, this is truer than ever. However, there are some guidelines one can follow when trying to decide on an algorithm to use, and while we’re on the subject of model creation, it’s useful to discuss good practices and ways in which you can fall into unforeseen mistakes.


Before we dig into the topic of model selection, I want to take a moment to address an important idea you should be considering when designing the structure of your project: cross-validation. Cross-validation is essentially a model evaluation process that allows you to check the effectiveness of your model on more than one data set by building multiple training sets from your existing data. This is done by moving a testing data ‘fold’ through your data, building models with what isn’t set aside for training.

For example: let’s say you have 100 records, and want to use 10 ‘fold’ cross-validation, essentially building 10 distinct models. For the first model, you might use the first 10 records for your testing data, and then the next 90 records as training data. Once you’ve built a model with that training data and tested it on that testing data, you move the testing fold down to the 11th-20th records, and use the 1st-10th and 21st-100th records combined together as your training data. This process repeats, moving the testing fold through the dataset, until you have 10 distinct models and 10 distince results, which will give you a more robust picture of how well your process has learned than if you just built one model with all your data.

Cross-validation is mostly a straightforward process, but there are a couple of things to watch out for while you’re performing it. The first possible issue is the introduction of bias into your testing data. You have to be careful with the data transformations that you perform while using cross-validation; for example, if you’re boosting the presence of a target class by adding new rows of that class’ data, you have to make sure that the boosting occurs after you perform the testing-training split. Otherwise, the results of your testing will appear to be better than they would be otherwise, since your training data will be tainted with new, targeted data.

Another thing to consider is whether to use stratified cross-validation or not. Stratified cross-validation is an enhancement to the cross-validation process where instead of using the arbitrary order of your data set or a random sampling function to build your testing and training data, you sample from your whole data set equal to a similar proportional representation of each class as is in your whole data set. For example, if your data is 75% class A and 25% class B, stratified cross-validation would attempt to make testing and training samples that maintain that balance of classes. This has the benefit of more accurately depicting the nature of your original problem than using a random or arbitrary system.

Concerns in Selection

The major topic to think about when deciding what machine learning model to use is the shape and nature of your data. Some of the high level questions you might ask yourself: Is this a multi-class or binary class problem? (Note, if you only care about a single target class within a multi-class dataset, it’s possible to treat it as a binary class problem.)  What is the distribution of classes? If the distribution is highly uneven, you may want to avoid certain types of models, such as decision tree based models, or consider boosting the presence of the under-represented class.

Another major question is whether your data is linearly separable or not. While it’s rare that the complicated datasets you will encounter are truly ‘linearly’ separable, datasets that contain more clearly defined class boundaries are good candidates for models such as support vector machines. That being said, it can be difficult to get a good picture of your dataset if it has a high number of features (also known as high dimensionality). In this case, there are still ways to map your dataset in a 2D plane, and it can be highly useful to do so as an initial step before model selection, in order to give you new insights into your data set. Rather than detail the approaches here, here is a link to a post by Andrew Norton which details how you can use the matplotlib library in python to visualize multi-dimensional data.

One of the final considerations that you have to make when selecting your model is the size of your data, both in terms of volume and dimensionality. Obviously, as these variables increase, so will the runtime of your model training, but it’s worth noting that there are models that will build relatively quickly – such as a Random Forest algorithm – and models that as your data gets larger and larger will become prohibitively slow – such as many neural network implementations. Make sure that you understand your data, your hardware resources, and your expectations of runtime before you start learning and working on a new training algorithm.

Concerns in Construction

When it comes to actually building your models, there’s nothing stopping you from just plugging your data right into your machine learning library of choice and going off to the races, but if you do, you may end up regretting it. It’s important to realize as you’re building the framework for your project that everything – from your number of cross-validation folds to aspects of your pre-processing to the type of model itself – is not only subject to change as you experiment, but also is highly likely to do so.

For that reason, it’s more critical than ever that you write modular, reusable code. You will be making changes. You will want to be able to pass a range of values to any given aspect of your code, such as the percentage of features to select from your raw data set. Make your life easier by starting the project with different pieces in different functions, and any values that may need to be updated as testing happens used as function parameters.

A similar concept applies to flow controls. It may be that you want to be able to turn on or off your feature selection functionality, or your class boosting, or switch quickly between different models. Rather than having to copy-paste or comment out large chunks of code, simply set up an area at the beginning of your scope with Boolean values to control the different aspects of your program. Then, it’ll be a simple change from True to False to enable or disable any particular part of your process.


I hope this post has given you some insights into things to think about before starting the construction of your machine learning project. There are many things to consider before any actual testing begins, and how you answer these questions and approach these problems can make that process either seamless or very frustrating. All it takes are a few good practices and a bit of information gathering for you to be well on your way to unlocking the knowledge hidden in your data.



One of the things you realize quickly going from guides, classes, and tutorials into hands-on machine learning projects is that real data is messy. There’s a lot of work to do before you even start considering models, performance, or output. Machine learning programs follow the “garbage in, garbage out” principle; if your data isn’t any good, your models won’t be either. This doesn’t mean that you’re looking to make your data pristine, however. The goal of pre-processing isn’t to support your hypothesis, but instead to support your experimentation. In this post, we’ll examine the steps that are most commonly needed to clean up your data, and how to perform them to make genuine improvements in your model’s learning potential.

Handling Missing Values

The most obvious form of pre-processing is the replacement of missing values. Frequently in your data, you’ll find that there are missing numbers, usually in the form of a NaN flag or a null. This could have been because the question was left blank on your survey, or there was a data entry issue, or any number of different reasons. The why isn’t important; what is important is what you’re going to do about it now.

I’m sure you’ll get tired of hearing me say this, but there’s no one right answer to this problem. One approach is to take the mean value of that row. This has the benefit of creating a relatively low impact on the distinctiveness of that feature. But what if the values of that feature are significant? Or have a wide enough range, and polarized enough values, that the average is a poor substitute for what would have been the actual data? Well, another approach might be to use the mode of the row. There’s an argument for the most common value being the most likely for that record. And yet, you’re now diminishing the distinctiveness of that answer for the rest of the dataset.

What about replacing the missing values with 0? This is also a reasonable approach. You aren’t introducing any ‘’new” data to the data set, and you’re making an implicit argument within the data that a missing value should be given some specific weighting. But that weighting could be too strong with respect to the other features, and could cause those rows to be ignored by the classifier. Perhaps the most ‘pure’ approach would be to remove any rows that have any missing values at all. This too is an acceptable answer to the missing values problem, and is also one that maintains the integrity of the data set, but it is frequently not an option depending on how much data you have to give up with this removal.

As you can see, each approach has its own argument for and against, and will impact the data in its own specific way. Take your time to consider what you know about your data before choosing a NaN replacement, and don’t be afraid to experiment with multiple approaches.

Normalization and Scaling

As we discussed in the above section with the 0 case, not all numbers were created equal. When discussing numerical values within machine learning, people often refer to numerical values instead as “continuous values”. This is because numerical values can be treated as having a magnitude and distance from each other (ie, 5 is 3 away from 2, is more than double the magnitude of 2, etc). The importance of this lies in the math of any sort of linearly based or vector based algorithm. When there’s a significant difference between two values mathematically, it creates more distance between the two records in the calculations of the models.

As a result, it is vitally important to perform some kind of scaling when using these algorithms, or else you can end up with poorly “thought out” results. For example: one feature of a data set might be number of cars a household owns (reasonable values: 0-3), while another feature in the data set might be the yearly income of that household (reasonable values: in the thousands). It would be a mistake to think that the pure magnitude of the second feature makes it thousands of times more important than the former, and yet, that is exactly what your algorithm will do without scaling.

There are a number of different approaches you can take to scaling your numerical (from here on out, continuous) values. One of the most intuitive is that of min-max scaling. Min-max scaling allows you to set a minimum and maximum value that you would like all of your continuous values to be between (commonly 0 and 1) and to scale them within that range. There’s more than one formula for achieving this, but one example is:

X’ = ( ( (X – old_min) / (old_max – old _min) ) * (new_max – new_min) ) + new_min

Where X’ is your result, X is the value in that row, and the old_max/min are the minimum and maximum of the existing data.

But what if you don’t know what minimum and maximum values you want to set on your data? In that case, it can be beneficial to use z-score scaling. Z-score scaling is a scaling formula that gives your data a mean of 0 and a standard deviation of 1. This is the most common form of scaling for machine learning applications, and unless you have a specific reason to use something else, it’s highly recommended that you start with z-score.

The formula for z-score scaling is as follows:

X’ = (X – mean) / standard_deviation

Once your data has been z-score scaled, it can even be useful to ‘normalize’ it by applying min-max scaling on the range 0-1, if your application is particularly interested in or sensitive to decimal values.

Categorical Encoding

We’ve focused entirely on continuous values up until now, but what about non-continuous values? Well, for most machine learning libraries, you’ll need to convert your string based or “categorical” data into some kind of numerical representation. How you create that representation is the process of categorical encoding, and there are – again – several different options for how to perform it.

The most intuitive is a one-to-one encoding, where each categorical value is assigned with and replaced by an integer value. This has the benefit of being easy to understand by a human, but runs into issues when being understood by a computer. For example: Let’s say we’re encoding labels for car companies. We assign 1 to Ford, 2 to Chrysler, 3 to Toyota, and so on. For some algorithms, this approach would be fine, but for any that involve distance computations, Toyota now has three times the magnitude that Ford does. This is not ideal, and will likely lead to issues with your models.

Instead, it could be useful to try to come up with a binary encoding, where certain values can be assigned to 0 and certain values can be assigned to 1. An example might be engine types, where you only care if the engine is gas powered or electric. This grouping allows for a simple binary encoding. If you can’t group your categorical values however, it might be useful to use what’s called ‘one-hot encoding’. This type of encoding converts every possible value for a feature into its own new feature. For example: the feature “fav_color” with answers “blue”, “red”, and “green”, would become three features, “fav_color_blue”, “fav_color_green”, and “fav_color_red”. For each of those new features, a record is given a 0 or a 1, depending on what their original response was.

One-hot encoding has the benefits of maintaining the most possible information about your dataset, while not introducing any continuous value confusion. However, it also drastically increases the number of features your dataset contains, often with a high cost to density. You might go from 120 categorical features with on average 4-5 answers each, to now 480-600 features, each containing a significant number of 0s. This should not dissuade you from using one-hot encoding, but it is a meaningful consideration, particularly as we go into our next section.

Feature Selection

Another way in which your data can be messy is noise. Noise, in a very general sense, is any extraneous data that is either meaningless or confuses your model. As the number of features in your model increases, it can actually become harder to distinguish between classes. For this reason, it’s sometimes important to apply feature selection algorithms to your dataset to find the features that will provide you with the best models.

Feature selection is a particularly tricky problem. At its most simple, one can just remove any features that contain a single value for all records, as they will add no new information to the model. After that, it becomes a question of calculating the level of mutual information and/or independence between the features of your data. There are many different ways to do this, and the statistical underpinnings are too dense to get into in the context of this blog post, but several machine learning libraries will implement these functions for you, making the process a little easier.

Outlier Removal

The other main form of noise in your data comes from outliers. Rather than a feature causing noise across multiple rows, an outlier occurs when a particular row has values that are far outside the “expected” values of the model. Your model, of course, tries to include these records, and by doing so pulls itself further away from a good generalization. Outlier detection is its own entire area of machine learning, but for our purposes, we’re just going to discuss trying to remove outliers as part of pre-processing for classification.

The simplest way to remove outliers is to just look at your data. If you’re able to, you certainly can by hand select values for each feature that are beyond the scope of the rest of the values. If there are too many records, too many features, or you want to remain blind and unbiased to your dataset, you can also use clustering algorithms to determine outliers in your data set and remove them. This is arguably the most effective form of outlier removal, but can be time consuming as you now have to build models in order to clean your data, in order to build your models.


Pre-processing may sound like a straightforward process, but once you get into the details it’s easy to see its importance in the machine learning process. Whether its preparing your data to go into the models, or trying to help the models along by cleaning up the noise, pre-processing requires your attention, and should always be the first step you take towards unlocking the information in your data.