AWS SageMaker is a machine learning platform for data scientists to build, train, and deploy predictive ML models. Unlike the traditional machine learning process, SageMaker allows data scientists to hop into the driver’s seat on projects and complete all three steps independently.
In an attempt to show you how to use SageMaker yourself, we decided to use the platform to create a machine learning model of our own. Using publicly available airline data, our goal was to create a model that would predict whether or not a flight would be delayed. For this blog post, we are going to walk you through from start to finish how we created our machine learning model using SageMaker. If you would like to follow along with us, click HERE to access an HTML copy of our notebook. Let’s get started!
The Set Up
To start creating resources, we first went to the AWS website and logged into the console. Before we use SageMaker, we need to ensure that we either have an S3 bucket ready to use, or create a bucket that we can use to create our instance. Whatever S3 bucket you decide to call, it MUST be in the same region as your notebook instance.
Once we have an S3 bucket ready to use, we can now head over to SageMaker within the AWS console to set up our notebook. When we arrive at the SageMaker dashboard, click on the button that reads “Notebook Instances.” This will direct you to where you can either access a pre-existing notebook, or make a new one. To make a new resource, we are going to follow the following procedure:
- Select “Create notebook instance”
- Give your notebook a name
- Select the instance type
- You need to ensure that your instance is big enough to support your workload. Typically, we will use an ml.t2.medium instance, which has 4 GiB of memory, 2 vCPU, and low network performance. For our data set, we used an ml.m5.xlarge instance which has 16 GiB of memory, 4 vCPU, and high network performance.
- If you would like to add an Elastic Inference, select which type you would like to add here; note that this is not required
- Call your IAM role
- Select “Create a new role”
- Select “Specific bucket” → type in the name of the specific S3 bucket you would like to call
- Select “Network”
- Select “VPC” and select the default option from the drop down menu
- Select a subnet
- Select the default security group from the drop down menu
- Select “Create notebook instance” at the bottom of the page
From here you will be directed to the Jupyter Notebook server:
- If you need to upload your data from your drive, select “Upload” in the top right corner, and then select the file you wish to upload into the server
- To create your notebook, select “New” in the top right hand corner, and then select your preferred notebook type from the drop down menu. SageMaker supports both Python and R, and numerous environments. For our example, we decided to use “conda_python3.” From here you will be directed into your notebook resource where you can begin creating your model.
Step One: Preparing the Data
Now that our notebook is created and ready to use, we can begin building our model. Before we get started, let us note that we have decided to use the “linear learner” algorithm, which will dictate how we approach certain steps. If you decide to use a different algorithm, check out the SageMaker documentation for what needs to be done. Here is how we went through the building step using SageMaker:
- Import all of the necessary libraries and modules; to see which ones we used, check out the HTML file that contains all of our code.
- Read your csv as a pandas data frame
- Clean the data
- All data cleaning is subjective and up to you
- Format the data according to the model’s requirements. In the case of the linear learner, we must do the following:
- The data must be transformed according to how it is being fed to the training algorithm
- This information can be found in the SageMaker documentation
- For the linear learner, we transformed the data into matrices
- All variables need to be integers
- In the case of a binary classifier, all Booleans must be expressed as 0 and 1. In the case of a multi linear classifier, the label must be between 0 and the number of classes minus 1.
- The data must be transformed according to how it is being fed to the training algorithm
Step Two: Training the Model
- First we need to decide what algorithm we are going to use. For our model, we used linear learner. The remaining list of readily available algorithms is provided in the SageMaker documentation.
- Split up the data into training, testing, and validation data. We split the data in the following proportions:
- 80% used for training
- 10% used for testing
- 10% used for validation
- The splitting of the data is subjective and depends on the use case and other factors
- Create three separate channels for the training, testing, and validation data respectively
- Start a training instance using your algorithm of choice. This creates a container with the algorithm, which in our case was linear learner, implemented and ready to use.
- Run the training job
- This will take a bit of time; our training job took approximately 1600 seconds, which equates to about 27 minutes. The length of the training job largely depends on the size of your data and your algorithm.
Once the training job is complete, you are ready to deploy and evaluate your model to see how well it performs.
Step Three: Deploying and Evaluating the Model
- Using one line of code, deploy the model to return an endpoint so the model can now be used to make predictions.
- Once the model is deployed, create a function and call it to evaluate the accuracy of the model
- We found that our model runs with 76.72% accuracy, which is decent.
- We found this metric simply by taking the number of successful predictions out of the total number of predictions made using the test data. Therefore, we can say that our model accurately predicted if there was a flight delay 76.72% of the time.
- Make sure you terminate the endpoint when you are done using it so you are not billed for resources not being used.
Final Thoughts and Best Practices
After using the application to build our model, here are some of our final thoughts: =
- SageMaker helped to streamline the process and saved an immense amount of time, particularly in the training and deployment stages
- Make sure you select an instance type that best fits the need of your data. You can run into issues, such as the kernel frequently crashing, if your instance type does not support the size of your data and complexity of your chosen algorithm. For more information, consult the SageMaker documentation.
- Use SageMaker documentation and supplementary resources to guide you through the process, as well as help provide explanations for AWS specific features you may not already be familiar with.
- Unlike some other machine learning platforms, SageMaker does require ample experience and knowledge in Python. If you are inexperienced or novice in this area, this may not be the best platform for you
- Always terminate resources you are finished using.
- Stop running the notebook once you are finished working on it
- Terminate the endpoint when you are done using it
Once again, if you would like to follow along with us or see how we went about this process, click HERE to look at an HTML file of our notebook. If you are interested in learning more about AWS SageMaker, feel free to reach out to our team – we would love to talk to you!