Blog

Amazon Aurora is relational engine that is designed to be compatible with databases like MySQL and PostgreSQL. This allows you to make your existing application work in the cloud – but at 5x the standard performance of these databases! Amazon Aurora supports very large databases with compromising performance. It’s a reliable and cost-efficient option for growing businesses.

Below you can explore the plethora of updates that came with RDS for Amazon Aurora in 2019!

 

Amazon Aurora Updates

Sharing and Cross-Region Copying of Snapshots

Aurora Serverless DB clusters snapshots can now be shared publicly or with other AWS accounts. Aurora Serverless DB cluster snapshots across multiple AWS regions.

Authorized AWS accounts can easily restore a DB cluster from the shared DB cluster snapshot without copying the snapshot.

This feature is also supported by Amazon Aurora with PostgreSQL.

Capacity of 1 Unit and a New Scaling Option

Instead of managing and providing DB instances, Aurora serverless lets you provide minimum (1 unit) and maximum values of ACUs for your DB cluster. An ACU is a capacity measure that describes both performance and memory capacity.

Availability to 14 AWS Regions

Aurora global is now supported and is available in 14 AWS regions – see below.

EU (Ireland), EU (London), US East (Ohio), Asia Pacific (Seoul), US East (N. Virginia), US West (Oregon), Asia Pacific (Mumbai), Canada (Central), EU (Frankfurt), Asia Pacific (Singapore), EU (Paris), Asia Pacific (Tokyo), US West (Northern California), and Asia Pacific (Sydney).

Replications can be set in-between any of the two AWS regions mentioned above.

 

Performance Insight Support

Performance Insights a feature that monitors the database performance by visualizing database load and performance. This later helps in spotting and troubleshooting performance challenges.

This feature is available in Amazon Aurora.

RDS Recommendations

The RDS Recommendations feature in Amazon Aurora provides automated best practice guidance by evaluating your configuration, usage, and performance data. This includes information about issues in configurations, such as mismatch in instance classes or outdated versions of engine.

 

Amazon Aurora with MySQL Updates

Performance Insights Counters

Performance Insights is an advanced performance monitoring tool for database workloads. Additional Counter Metrics are now supported for AWS MySQL RDS. Counter Metrics allows customizing the Performance Insights dashboard with 10 additional graphs.

Amazon CloudWatch

Amazon Aurora serverless- MySQL compatible databases can now be set to issue a log report to Amazon CloudWatch that holds information about general logs, slow query logs, audit logs, and error logs. This will help you keep track of your Aurora Serverless databases query performance, database activity, and errors.

Supports db.r5 an db.t3 DB Instance Classes

Amazon EC2 R5 instances are the next generation of memory optimized instances for the Amazon Elastic Compute Cloud.  R5 instances are well suited for memory intensive applications such as high performance databases, distributed web scale in-memory caches, mid-size in-memory databases, real time big data analytics, and other enterprise applications.

T3 instances are the next generation low cost burstable general-purpose instance type that provide a baseline level of CPU performance with the ability to burst CPU usage at any time for as long as required. T3 instances are designed for applications with moderate CPU usage that experience temporary spikes in use.

Amazon Aurora for MySQL now supports R5 and T3 instance classes.

GTID based replication

Amazon Aurora with MySQL 5.7 compatibility now supports GTID. GTID (Global Transaction Identifier) is a unique identifier that is assigned to each transaction of the MySQL database.

Faster Migration from MySQL 5.7 databases

MySQL 5.7 backups stored in S3 can now be restored in Amazon Aurora with MySQL compatibility. Both full and incremental backups of the database can be restored. This is much faster than the mysqldump utility.

Data API

Data API is now available for Aurora with MySQL Compatibility. Web service based applications like AWS Lambda, AWS AppSync, AWS Cloud9 can access the Aurora cluster. Data API doesn’t require a persistent connection to the DB cluster. It provides a secure HTTP endpoint to run SQL statements.

 

Amazon Aurora with PostgreSQL Updates

Logical Replication

Amazon Aurora with PostgreSQL compatibility supports Logical Replication. Using logical replications, you are able to replicate the data changes from one Amazon Aurora database to other databases by utilizing replication tools like Amazon DMS or native replication slots of PostgreSQL.

Activity Streams

Amazon Aurora with PostgreSQL now supports Activity Streams for real time monitoring. Database Activity Streams provide a real-time data stream of the database activity in your relational database.

Cluster cache management

Cluster cache management is now supported by Amazon Aurora with PostgreSQL compatibility. This feature provides a faster way to recover incases of failovers. With Cluster cache management, a specific reader DB instance in the cluster is designated as the failover target. Data in the designated failover target cache is synchronized with the read-write instance cache. Incases of failover, the designated failover target is promoted as the read-write instance.

Data import from Amazon S3

Amazon Aurora with PostgreSQL allows you to import data from an Amazon S3 bucket into Aurora Postgres instance. A new extension aws_s3 has been added to support the import of data. Any data format that is supported by PostgreSQL COPY command is supported for import.

Copy tags from cluster to database snapshot

Tags on an Amazon Aurora with PostgreSQL compatibility database cluster can automatically be copied to any cluster-created database snapshots. This allow easily setting metadata on the snapshots to match the parent cluster and access policies.

Restore an Encrypted Amazon Aurora PostgreSQL Database from an Unencrypted Snapshot

An encrypted Amazon Aurora with PostgreSQL compatibility cluster can now be restored from an unencrypted snapshot. To encrypt the new database, create a new key or choose any existing Amazon Key Management Service (KMS) key.

Cloning Across AWS Accounts

Amazon Aurora DB cluster can now be shared with other AWS accounts. This provides a quick and efficient way of cloning the database. Database cloning does not require additional space and is faster than restoring a snapshot.

Clearly, RDS for Amazon Aurora saw a lot of upgrades this year! Feel free to contact us if you want to get started with these services.

 Like

Amazon RDS (Relational Database Service) is one of the many web services offered with AWS. Amazon RDS makes the job of database administrators easier – aiding to setup, operate, and scale the database and deployments. RDS is a great platform for DBAs who are looking to spend less time on traditional tasks like platform work, security policy provisions, performance monitoring and database handling. This frees up time to focus on developing an application.

RDS supports DB engines such as MySQL, MariaDB, PostgreSQL, SQL Server and Oracle. If you now have an Oracle database and are planning to move to cloud, RDS will help you deploy within minutes. You won’t lose any of your capabilities or features and will gain flexible capacity at a low cost. You can also migrate any applications or tools that use Oracle databases. RDS is improving constantly, with the primary goal of making its platform easier for DBAs.

Without further adieu, here are the new features added to RDS for Oracle in 2019.

Supports a new version of SQLTXPLAIN (SQLT)

SQLT is an Oracle tool that helps identify and troubleshoot performance of SQL Statements. Oracle SQLT delivers a report that includes statistics, execution plan, optimizer performance and other details that can help tune SQL queries. RDS for Oracle now supports 12.2.180725 version of SQLTXPLAIN.

Supports db.z1d and db.t3 DB instance classes

Z1d are the latest generation of instance classes optimized for memory intensive applications. They offer both high compute capacity and high memory footprint.

T3 instances are the next generation low cost burstable general-purpose instance type that provide a baseline level of CPU performance with the ability to burst CPU usage at any time for as long as required. T3 instances are designed for applications with moderate CPU usage that experience temporary spikes in use.

Oracle RDS now supports for the Z1d and T3 instance families.

Supports Oracle APEX versions 19.1.v1 ,18.1.v1 and 18.2.v1

APEX (Application Express) is a low code browser-based development platform that helps developers to build applications that can be deployed anywhere. 11.1, 12.1 and 12.2 versions of Oracle in RDS now support 19.1, 18.1, and 18.2 versions of Oracle APEX.

Supports Amazon S3 integration

Oracle RDS now supports integration with S3 for data ingress and egress. This integration support easy, efficient and secure data transfer between Oracle RDS and S3. Before this integration support, data transfer from S3 to Oracle RDS required an additional DB instance and a networking link between the two instances.

Supports Read Replicas

Oracle RDS supports Read replicas with Active Data Guard. Oracle RDS makes it easy to create replicas within the same region. The Primary DB instance manages the configuration of Active Data Guard and has a secure communication between the replica instances.

Supports RMAN DBA tasks

Oracle RDS supports Oracle Recovery Manager (RMAN) DBA tasks. Using Amazon RDS package rdsadmin.rdsadmin_rman_util, database backups can be performed to disk. rdsadmin.rdsadmin_rman_util package supports full and incremental database file backups, tablespace backups, and archive log backups.

Support for 64 TiB of storage and 80k IOPS

Maximum Storage of Oracle has now been increased to 64 TiB. Oracle RDS also supports a provisioned IOPS performance of 80,000 IOPS. Any existing databases can be scaled up to 64 TiB without any downtime. With this increased limit, you can consolidate your database shards into a single database instance.

Amazon RDS Performance Insights Counters

Performance Insights is an advanced performance monitoring tool for database workloads. Additional Counter Metrics are now supported for AWS Oracle RDS. Counter Metrics allows customizing the Performance Insights dashboard with 10 additional graphs.

Amazon RDS Performance Insights Supports Viewing More SQL Text

Performance Insights for Oracle RDS now supports 1024 bytes of SQL text to be shown for each row in Top Load Items. The earlier limit was 500 Bytes.

 Like

AWS SageMaker is a machine learning platform for data scientists to build, train, and deploy predictive ML models. Unlike the traditional machine learning process, SageMaker allows data scientists to hop into the driver’s seat on projects and complete all three steps independently.

In an attempt to show you how to use SageMaker yourself, we decided to use the platform to create a machine learning model of our own. Using publicly available airline data, our goal was to create a model that would predict whether or not a flight would be delayed. For this blog post, we are going to walk you through from start to finish how we created our machine learning model using SageMaker. If you would like to follow along with us, click HERE to access an HTML copy of our notebook. Let’s get started!

The Set Up

To start creating resources, we first went to the AWS website and logged into the console. Before we use SageMaker, we need to ensure that we either have an S3 bucket ready to use, or create a bucket that we can use to create our instance. Whatever S3 bucket you decide to call, it MUST be in the same region as your notebook instance.

Once we have an S3 bucket ready to use, we can now head over to SageMaker within the AWS console to set up our notebook. When we arrive at the SageMaker dashboard, click on the button that reads “Notebook Instances.” This will direct you to where you can either access a pre-existing notebook, or make a new one. To make a new resource, we are going to follow the following procedure:

  1. Select “Create notebook instance”
  2. Give your notebook a name
  3. Select the instance type
    1. You need to ensure that your instance is big enough to support your workload. Typically, we will use an ml.t2.medium instance, which has 4 GiB of memory, 2 vCPU, and low network performance. For our data set, we used an ml.m5.xlarge instance which has 16 GiB of memory, 4 vCPU, and high network performance.
  4. If you would like to add an Elastic Inference, select which type you would like to add here; note that this is not required
  5. Call your IAM role
    1. Select “Create a new role”
    2. Select “Specific bucket” → type in the name of the specific S3 bucket you would like to call
  6. Select “Network”
    1. Select “VPC” and select the default option from the drop down menu
    2. Select a subnet
    3. Select the default security group from the drop down menu
  7. Select “Create notebook instance” at the bottom of the page

From here you will be directed to the Jupyter Notebook server:

  1. If you need to upload your data from your drive, select “Upload” in the top right corner, and then select the file you wish to upload into the server
  2. To create your notebook, select “New” in the top right hand corner, and then select your preferred notebook type from the drop down menu. SageMaker supports both Python and R, and numerous environments. For our example, we decided to use “conda_python3.” From here you will be directed into your notebook resource where you can begin creating your model.

Step One: Preparing the Data

Now that our notebook is created and ready to use, we can begin building our model. Before we get started, let us note that we have decided to use the “linear learner” algorithm, which will dictate how we approach certain steps. If you decide to use a different algorithm, check out the SageMaker documentation for what needs to be done. Here is how we went through the building step using SageMaker:

  1. Import all of the necessary libraries and modules; to see which ones we used, check out the HTML file that contains all of our code.
  2. Read your csv as a pandas data frame
  3. Clean the data
    1. All data cleaning is subjective and up to you
  4. Format the data according to the model’s requirements. In the case of the linear learner, we must do the following:
    1. The data must be transformed according to how it is being fed to the training algorithm
      1. This information can be found in the SageMaker documentation
      2. For the linear learner, we transformed the data into matrices
    2. All variables need to be integers
    3. In the case of a binary classifier, all Booleans must be expressed as 0 and 1. In the case of a multi linear classifier, the label must be between 0 and the number of classes minus 1.

Step Two: Training the Model

  1. First we need to decide what algorithm we are going to use. For our model, we used linear learner. The remaining list of readily available algorithms is provided in the SageMaker documentation.
  2. Split up the data into training, testing, and validation data. We split the data in the following proportions:
    1. 80% used for training
    2. 10% used for testing
    3. 10% used for validation
    4. The splitting of the data is subjective and depends on the use case and other factors
  3. Create three separate channels for the training, testing, and validation data respectively
  4. Start a training instance using your algorithm of choice. This creates a container with the algorithm, which in our case was linear learner, implemented and ready to use.
  5. Run the training job
    1. This will take a bit of time; our training job took approximately 1600 seconds, which equates to about 27 minutes. The length of the training job largely depends on the size of your data and your algorithm.

Once the training job is complete, you are ready to deploy and evaluate your model to see how well it performs.

Step Three: Deploying and Evaluating the Model

  1. Using one line of code, deploy the model to return an endpoint so the model can now be used to make predictions.
  2. Once the model is deployed, create a function and call it to evaluate the accuracy of the model
    1. We found that our model runs with 76.72% accuracy, which is decent.
    2. We found this metric simply by taking the number of successful predictions out of the total number of predictions made using the test data. Therefore, we can say that our model accurately predicted if there was a flight delay 76.72% of the time.
  3. Make sure you terminate the endpoint when you are done using it so you are not billed for resources not being used.

Final Thoughts and Best Practices

After using the application to build our model, here are some of our final thoughts: =

  • SageMaker helped to streamline the process and saved an immense amount of time, particularly in the training and deployment stages
  • Make sure you select an instance type that best fits the need of your data. You can run into issues, such as the kernel frequently crashing, if your instance type does not support the size of your data and complexity of your chosen algorithm. For more information, consult the  SageMaker documentation.
  • Use SageMaker documentation and supplementary resources to guide you through the process, as well as help provide explanations for AWS specific features you may not already be familiar with.
  • Unlike some other machine learning platforms, SageMaker does require ample experience and knowledge in Python. If you are inexperienced or novice in this area, this may not be the best platform for you
  • Always terminate resources you are finished using.
    • Stop running the notebook once you are finished working on it
    • Terminate the endpoint when you are done using it

Once again, if you would like to follow along with us or see how we went about this process, click HERE to look at an HTML file of our notebook. If you are interested in learning more about AWS SageMaker, feel free to reach out to our team – we would love to talk to you!

 Like

Welcome back to our fourth and final installment of our series! In this post, we will be discussing single and multiple linear regression, so let’s jump right into it.

Making Predictions Using Single Linear Regression

Linear regression is a statistical modeling tool that we can use to predict one variable using another. This is a particularly useful tool for predictive modeling and forecasting, providing excellent insight on present data and predicting data in the future. The goal of linear regression is to create a line of best fit that can predict the dependent variable with an independent variable while minimizing the squared error. That was a pretty technical explanation, so let’s simplify. We are trying to find a line that we can use to predict one variable by using another, all while minimizing error in this prediction. In order for this line to be helpful, we need to find the equation of the line which is as following: 

    • ŷ → the predicted dependent variable 
    •  → the slope of the line
    • x → the independent variable aka the variable we are using to predict y
    • → the intercept of the line

This equation may look familiar, and it should. It is the same equation as your standard y= mx + b that you learned back in Algebra I, just written a little differently in statistics language. 

Some of these concepts are difficult to understand on their own, so let’s apply them to our example housing data set to see them in action.

When we calculated correlation, we found a pretty strong correlation between the size of the living room and the price. Now that we know that there is something going on between those variables, we want to see if the living room size can be used to predict the price of our house. Before we jump into calculations, we first want to make sure that we can actually use the linear regression methodology; this means we need to check to see if the relationship between our living room size and price is approximately linear. 

On the left we have all of our data points scattered, and on the right we have the same graph except the line of best fit that we are trying to find is included as well:

When we scatter our data, we find that the relationship between the living room size and the price is approximately linear, meaning we now have the thumbs up to begin our linear regression. 

Testing single linear regression can be done by hand, but it is much easier and quicker to use tools like Excel or Jupyter Notebook to create predictive models. In order to conduct a regression analysis in Excel, you need to make sure that your Analysis ToolPak is activated. How to activate the ToolPak will depend on what version of Excel you are running, so I would recommend just doing a quick Google search on how to set it up. It is fairly simple, so this shouldn’t take any longer than a minute or two. Once it is activated, you are good to go!

Here is a brief outline of how to conduct your regression analysis using Excel: 

  1. Select “Data” tab → Select “Data Analysis” → Select “Regression”
  2. Input Y Range → select the data for the variable you are trying to predict; dependent variable 
    1. With our example: Price
  3. Input X Range → select the data for the variable you are using to predict the other variable; independent variable
    1. With our example: Living room size 
  4. Select the “Labels” box so that your output sheet will include the corresponding data labels
  5. If you want your regression output to be inserted onto a new worksheet page, check the box for “New Worksheet Ply”
  6. Check the boxes that say “Residuals” and “Residual Plots;” this will be important later on in our analysis and conclusions

When you press enter, you will be directed to a new workbook sheet that will have a bunch of output tables that may seem overwhelming and confusing, but don’t worry! We are going to show you which numbers are actually important for you to look at. 

We have highlighted all of the boxes / numbers that are important to look at. Let’s walk through these individually.

  1. R Square → the r square value tells us how good our model is at predicting the dependent variable. The closer the number is to 1, the better our model is. With this output, we see our r square value is 0.4949, which means that 49.49% of our data can be explained by our model. 
  2. Intercept Coefficient → the intercept coefficient value tells us our  value, which is the intercept of the regression line; this tells us that when the square footage of the living room is zero, the price of the house is -$43580.74
    1. NOTE: The intercept is not always applicable, which is true in our instance since a negative house price is impossible
  3. Sqft_living Coefficient → this gives us the slope of our line, which is ; the slope tells us as the square footage of the living room increases by one square foot, the price of the home increases by $280.62.
  4. Sqft_living p-value → this will tell us if the predictor in question, which in our case is the living room size, is a good predictor of our dependent variable. In this instance, with a p-value of approximately zero, we can conclude that the living room size is a good predictor of housing price. 
    1. Note: You can also look at the Significance F value, but if your predictor is statistically significant, the model will be too and vice versa

Now let’s put it all together to create the equation of our line:

Predicted Price = 280.62(living room size) – 43580.74

Now, ideally, given a living room size, we can make a prediction of the price of any given house. 

Now that we have built our model, it is important to then look at our residuals, otherwise known as errors. The residual / error, typically denoted as “e,” is essentially how far off our prediction was from the actual price. This is given by the following equation: 

e = actual – predicted

If you want to see this written in statistics language, keep reading; if not, go ahead and jump to the next section. 

e = y – ŷ 

Visually, if the data were to be scattered and the line of best fit laid on top, then the error values would be the distance from the actual observed value to the line at that specific x value, which would be the predicted y value. Here is a visual to help conceptualize this: 

When our error is negative, that means that we overestimated the price; when our error is positive, that means we underestimated the price. When our error is zero, that means our predicted price was accurate with the actual housing price given. The smaller the error, the more accurate our prediction is. 

When looking at error, we want to ensure that we are equally overestimating and underestimating our data, as well as that errors are occurring randomly. This is why when we are constructing our data analysis we want to include residuals and residual plots. How do we make sure these assumptions are being upheld? Excel will calculate our all of our errors and then scatter them on a graph with the x axis being the independent variable and the y axis being the error value. We will look for the following characteristics on the graph to ensure these assumptions are upheld: 

  1. The data points are scattered evenly around y = 0
    1. Equally over and underestimating the dependent variable
  2. The residual plot follows no pattern
    1. Errors are occurring randomly 

Here is an example of a residual plot that upholds these assumptions relatively well:

Here is the residual plot for our single linear regression analysis with living room size: 

Looking at the plot, we can see the residuals are approximately scattered around zero, meaning we are equally over and under estimating the price. However, we do notice a pattern, meaning the errors are not necessarily occurring randomly. Looking at the graph, we notice a horn shaped pattern, meaning that as the living room size increases, so does the error. This indicates a heteroscedasticity problem, which is a much simpler concept than the word is to pronounce. All this means is that the residuals are occurring unevenly, resulting in the pattern we see above. In order to uphold the regression assumptions, we would want to see a pattern of homoscedasticity, meaning that the residuals are occurring evenly throughout. This breach of the assumption is very important to note, because that means that the living room size may not be a good candidate for predicting house size. This means we should look more into other predictors and find a better candidate. Let’s take a look at the basement size.

Here we see that the that the p-value for basement size as a predictor and the F value for the model are both zero, meaning that the model is a good predictor of price. Moving to the r square value, we find that it is less than that of the living room model; this means that the model with the basement size explains much less data than that of the living room size. However, the issue that we came across with living room size lied in the residual plot, so let’s take a look at that. 

We first notice that the residuals are evenly scattered around zero. Then we notice that, unlike the living room size residual, there is no apparent pattern in the plot. This means that there is homoscedasticity and therefore is a better candidate for predicting price than the living room size is. 

Once again, this process is easily repeatable due to tools such as Excel. We have gone ahead and tested the remaining viable predictors and created a comprehensive table of our findings: 

PREDICTORSTATISTICALLY SIGNIFICANT? (Y/N)RESIDUAL PLOT
Living Room Size (in sqft) Yes

Heteroscedastic

Living Room Size 2015Yes

Homoscedastic

Basement (in sqft)Yes

Homoscedastic

Lot Size (in sqft)Yes

Heteroscedastic

Lot Size in 2015Yes

Heteroscedastic

Square footage of the house above groundYes

Heteroscedastic

 

Of the single predictor models, we can see that the basement model and the living room size in 2015 model uphold the assumptions the best out of all of the other predictors. However, it should be noted that both r square values were modest, so we need to see if there is a better model out there. We know that one variable likely is not able to be the soul predictor of housing price, so what if we combine multiple predictors into one model? That is where multiple regression comes in. 

More Predictive Modeling with Multiple Linear Regression

We are going to build off of single linear regression, so if you are still confused about that, it may be beneficial to go back through and review the previous section until you feel you comfortable with the concepts. If you are good to go, then let’s venture on to multiple linear regression! 

Multiple linear regression is very similar to single linear regression except it is a little bit more complex. Instead of looking at just one predictor, we are now going to incorporate multiple predictors into our model to see how good they are at predicting price together. So with our example, instead of looking at just one predictor of housing price at a time, we can create one model that incorporates all of our different predictors to see if we can better approximate the price.

One of the assumptions for multiple linear regression, just like single linear regression, is that the multivariate regression line is approximately linear. However, unlike our single linear regression line, it is near impossible to confirm whether or not we uphold this assumption since our line cannot be conceptualized on a 2-D plane. This does not pose as a serious obstacle since deviation from the assumption does not impact our results in a major way. However, it is a good rule of thumb to look at each variable’s linearity individually just to ensure that there is nothing that poses as a serious threat to the assumption. When looking at the variables we will be exploring, all of them follow an approximately linear relationship and therefore should uphold our linearity assumption for the multivariate model. 

Once again we will be using the Analysis TookPak in Excel, so make sure that is good to go once again. Head to the data tab and select data analysis and regression, just like we did with the single linear regression. The only thing that we are going to change here is our “X Input Range” values. Instead of inputting just one predictor, we are going to input all of our different predictors at the same time; so in this instance: the living room size, the lot size, and the square footage of the house above ground level. Make sure you select the “label,” “residual,” and “residual plot” boxes as well. Press enter, and you are good to go. 

Since multiple linear regression is more complex than single linear regression, it is fitting that there are a few more numbers that we will have to look at when doing our analysis. We went ahead and highlighted them for you: 

Once again, let’s go through each of these numbers individually:

  1. Adjusted R Square → As you increase the number of predictors in your model, your r square value will artificially inflate. That is why there is an adjusted r square value that accounts for the increase in the number of predictors to give a deflated and more accurate figure.
  2. Regression Significance F → This is like a p-value but for our entire model; it will tell us whether or not our model is a good fit for predicting our dependent variable, which in this case is price.
  3. Intercept Coefficient → Just like in single linear regression, this will tell us the intercept for our model. 
  4. Predictor Coefficients → The remaining coefficients for our different predictors will give us the slopes for each independent variable respectively.
  5. P-value → This tells us whether or not the variable is good at predicting price given that the other variables are already included in the model. 
    1. Ex: Living room size is a good predictor of price when lot size and the square footage of the house above ground are already included in the model 

Now we can put these together to create our equation: 

Price = 269.15(living room size) – 0.2782(lot size) – 16.90(sqft above ground) – 41445.124

Now looking at “Significance F,” we see that it is approximately zero. What does this mean? For consistency, let’s keep our significance level, or cut off value, at 0.05. Since our F value is less than our cut off, just like a p-value, we can say that our model is a good fit for our data and for predicting our dependent variable. This is a great tool for comparing different models. For example, let’s say we created another model that had an F value greater than 0.05, then we know it is not a good fit for the data, and therefore we should go with our other model since it statistically proves to be a better fit. 

But what if there are multiple models that prove to be significant? How do you pick which one is best? This is where it gets to be a bit more subjective and flexible. To do this, you will have to examine and compare your adjusted r square values. The higher your adjusted r square value is, the better your model is at explaining the data and therefore is a better model for predicting our dependent variable. However, if we increase our predictors and we only see a slight increase in the r square value, it technically is a “better model,” but we can conclude that the additional predictors are not really contributing to the model in a significant way since they don’t help explain that much more of the data. 

So let’s say that you drop all variables except for one and you find that your r square value is similar to that of the model that includes all of the possible variables, which will be referred to from here on out as the “extended model.” What does this result mean? This means that with one predictor, you are able to explain approximately the same amount of data as you can with the extended model. And what does that mean? That means that your other variables are mostly just deadweight, not really contributing anything meaningful to your model. Your model with just one variable is able to do just as good of a job as the model with multiple variables. You can then make the model more simplistic and concise by dropping variables that don’t significantly impact the model. This is what makes F testing flexible; you are able to play around with a multitude of different models and see for yourself which variables are significantly impacting your dependent variables and which aren’t. 

We played around with some different models and put them into a comprehensive table for you below:

PREDICTOR(S)SIGNIFICANCEADJUSTED R SQUARE
Living room size, lot size, sqft of above groundSignificant0.4942
Living room size and lot sizeSignificant0.4938
Living room size and sqft of above groundSignificant0.4932
Lot size and sqft of above groundSignificant0.3671
Living room size Significant0.4928
Lot SizeSignificant0.00799
Sqft of above groundSignificant 0.3667

 

This is a lot of data to digest, so let’s break it down so it is a bit easier to comprehend. We don’t need to bother looking at the significance since all of the tested models proved to be statistically significant. Does that mean that they are all good models? Technically yes, but we can pick the best model by looking at our adjusted r square number. We see that the highest r square value is the model that includes all of our variables; note that from here on out we will be referring to this as the “extended model.” Does that mean the extended model is the best model? Once again, technically yes, but let’s dig a little deeper. We see that our model that only includes the living room size has an r square value that is within hundredths of the extended model. This means that the model with only living room size can explain the same amount of data as the extended model. What can we do with this information? We can conclude that the lot size and square footage of the house above ground level are not really contributing that much since they are not really helping to explain more of the data, so we can drop them from our model and get almost identical results. It can’t hurt per se to use our extended model, but using just the one variable makes our model more concise and easier to work with, leaving less room for errors and complications. 

How do we know when adding a predictor to our model is useful? Let’s compare our model that includes only the lot size to our model that includes both lot size and living room size. When we add the living room size to the lot size model, our new model is now able to explain 48.581% more data. That’s a lot! This shows us that adding the living room size to our lot size model is extremely valuable since our r square value, and thus ability to explain the data with our model, has increased drastically. We also see our model be able to explain 12.71% more data when we add living room size to the lot size and square footage of the house above ground model, which is modest but still nonetheless impactful and noteworthy.  

Some Parting Thoughts and Comments

Thanks for sticking with us through the series! We hope that you have taken something away from these posts that can help you to better look at and analyze your data. We also hope that we have demonstrated to you that you don’t need to be a data scientist or a statistics expert to turn your data into insight. 

For this blog series we ran quite a few numbers and tests in both Excel and Jupyter Notebook, so once again, if you are interested in seeing exactly what went into the calculations and graph / figure creation, follow the links below to be directed to all of the work we did with our data. Within the Excel worksheet you will find lots of data and lots of tests. The first worksheet includes all of the data as well as some relevant information about the set and decoder keys. All subsequent pages are various tests that were conducted and reported throughout this paper. Within the Jupyter Notebook file you will also see the data uploaded from a csv file into the worksheet. Python was then used to create the sample size demonstration and all of the graphs / figures you see throughout. In addition, it demonstrates some alternative methods for calculating correlations, line parameters, and errors that we used to check our numbers with the Excel calculations. These functions that we have defined make calculations and testing highly repeatable, we would recommend using them with your own data set for those of you who are interested in using Jupyter Notebook instead. Do note, however, that the results on both Excel and Jupyter Notebook are identical, and neither is more accurate or beneficial than the other. This is simply just a way to show how you can perform these analyses on either platform depending on your skillset and knowledge.  

Link HERE for Excel workbook 

Link HERE for Jupyter Notebook file 

Part 1 of Series: Click here

Part 2 of Series: Click here

Part 3 of Series: Click here


Microsoft Excel Guide

  • =AVERAGE(array)
    • Takes the average of the array
  • =STDEV.S(array)
    • Standard deviation of the sample  
  • =CORREL(array1, array2)
    • Finds the correlation between data array 1 and data array 2 
  • =Z.DIST(z*, 1)
    • Finds the area to the LEFT of the z* value
    • Use “Z” when you are working with a population
  • =Z.INV(probability)
    • Finds the z* value for the corresponding probability value
  • =T.DIST(t*, df, 1)
    • Finds the area to the LEFT of the t* value 
    • Use “T” when you are working with a sample 
  • =T.INV(probability, df)
    • Finds the t* value for corresponding probability value 

A Note About the T.INV and Z.INV Functions 

We used these functions to find t* and z* multipliers for confidence levels / intervals. When we conducted a confidence interval at 95% confidence, we did not input 0.95 into the T.INV or Z.INV function, but rather 0.975. Here is an explanation as to how to find what value to input into these functions: 

The figure below exhibits the task at hand when finding a t* or z* value at a certain confidence level. The blue area represents our confidence level, which is also a probability value. What we want to do is find the t* or z* value that corresponds with this probability value, as represented by the red circle. 

In this example, we are finding a 95% confidence level, and therefore are looking at a probability value of 0.95. When using the T.INV or Z.INV function, we need to use ALL of the area to the left of the t* or z* value that we are trying to find; that means that we need to find the remaining area under the curve.

We know that the area under the curve must total 1, so therefore the remaining area much equal 5 (1-0.95). We also know that since we are working with a normal distribution, the curve is perfectly symmetrical. This means that the remaining area on either tail of the curve are equal. If the remaining two sections must equal 5 (0.05) and there are two tails, we know that each area must be 2.5 (5 / 2 = 2.5 or 0.05 / 2 = 0.025). Now we know that the left tail of the curve must be added to the 0.95 to find the cumulative area the the left of our multiplier, leaving us with 0.975 (0.95 + 0.025). The 0.975 value will be what we input into our Excel function to find the multiplier value.

As you conduct these tests more frequently, you will naturally memorize what value to input into your functions at different confidence levels. However, it may be difficult at first, so we recommend drawing out the curve, the area, and the multiplier you are finding to help visualize exactly what is being calculated and needs to be inputted. 

 

 Like

In our introductory blog piece, we discussed some high level statistical concepts which you can read HERE if you haven’t already or need a quick refresher. Now, we can finally start to dive more into the data and start running some tests. 

Correlation: A First Clue for a Relationship

A good place to start data analysis is by looking at correlation. Correlation, typically denoted as r in statistics language, is the relationship between the independent variable, x, and the dependent variable, y, and lies in between -1 and 1. Put even more simply, correlation tells us if there is a relationship occurring between two variables.

This is an important statistical concept, because it sets the stage for further investigation with our data. Finding the correlation between two values is our first clue that a relationship either exists, or does not exist, between the two variables. Using our knowledge about the strength of the relationship, we can then proceed to find out how good one is at predicting the other using linear regression methods. Essentially, correlation is the groundwork that allows us to determine which relationships are worth further investigation. Our correlation value can provide valuable insight on the strength and the direction of the relationship. 

Below is a comprehensive table that you can use to decode your correlation value: 

CHARACTERISTICINTERPRETATION
Positive correlation valuePositive direction / association; as one variable increases, so does the other
Negative correlation valueNegative direction / association; as one variable increases, the other decreases
|r| is closer to 1The closer |r| is to one, the stronger the association
|r| closer to 0The closer |r| is to zero, the weaker the association

Finding our correlation value, and many other statistical tools, is made easy in Excel. The function you will use to find r is: =CORREL(array1, array2). Array 1 will simply be all of the data for your first variable, and array 2 is all of the data for our second variable; the order is negligible. For example, if I was finding the correlation between GPA and SAT scores, I would first select all of my GPA data points, and then I would separate it from the second array with a comma, and then select all of my SAT data points. Then you press enter and voila! You have your correlation value. 

Going back to our example data set, we have created a correlation table that shows the correlation between all of our different variables. A heat map has been included on top to help visually show which have strong or weak associations and which have positive or negative associations. The more saturated the coloring of the box, the stronger the association. If the box is more red, the association is positive, and if the box is more blue, the association is negative. For example, a dark red box would indicate a strong, positive correlation, or a lighter blue would indicate a weak, negative correlation. 

Perhaps the most important thing to take away from this section is that correlation does NOT imply causation. Two variables may be correlated, but that does not mean that one causes the other; we cannot emphasize this enough. There may be other confounding variables, which are variables that are not accounted for, that may be the cause of the phenomena that we see. For example, let’s say you find there is a correlation between water consumption and the frequency of illness. It would absolutely ludicrous to draw the conclusion that drinking water causes you to be ill! Instead, we would use this finding of correlation to indicate that there is likely something else going on, such as the presence of cholera. While there is a correlation between the water consumption and illness, we know that the consumption of water isn’t truly what is causing the illnesses; it is the infectious bacteria that is contaminating the water that is causing people to become ill. This is why it is important to discern the difference between correlation and causation so you do not inaccurately report findings. Statistics lays down the ground works for your analysis, but proper research is necessary to really come to the true root of the phenomena.

AB Hypothesis Testing: A Statistical Experiment

Think of the science experiments that you used to do back in school. You pose a question that you are trying to answer; for example, how does sunlight affect plant growth? Then, you make an educated guess, or a hypothesis, as to what you think the results would be, and test whether or not your guess is correct. Statistical hypothesis testing is exactly the same; essentially, you think that there is some kind of relationship occurring between two variables, so you run a statistical experiment to see if your guess, or hypothesis, is correct. 

This process begins by looking at the correlation value, which will indicate whether or not there are variables worth investigating. For example, there was a high correlation between the grade, the size of the living room, and the square footage of the house above ground. Since the square footage of the house above ground and the size of the living room can take on an infinite number of values, that makes them continuous variables; this means that it would be advantageous to use linear regression to look deeper into their relationship with price, and even investigate if they can be used to predict it. However, the housing grade is a discrete variable, since there are a finite number of values that this variable can take. You cannot receive a housing grade of 2.5, only 2 or 3. Since grade is a discrete variable, it cannot be used in a linear regression model and would be tested best with the AB hypothesis testing method. 

From here, we are going to make a claim that we are trying to prove with the data; this is going to be the alternative hypothesis. However, a null hypothesis also needs to be set, which is a statement that essentially says that our inquiry was wrong, and the data does not support the claim made by the alternative hypothesis. However, the claim cannot be verified by just looking at the raw data; we will need a metric. The metric that is used to numerically show statistical significance is p-value. If you are statistics junkie, read our “For Statistics Junkies” section to learn more about the concepts behind p-value and how it shows us significance; if not, you can go ahead and skip it.

For Statistics Junkies : The “p” in p-value stands for probability, which is what we are actually trying to find during a hypothesis test. What is that probability exactly? It is the probability that we see the value we observe in our data, given that the null hypothesis is true. Put more simply, we have found some observed value in our data set, so what is the probability that we see that observation if the null hypothesis were true. If the p-value is small, that means that the probability of seeing that observation is low if the null hypothesis were actually true. If the p-value is large, that means that the probability of seeing that observation is high if the null hypothesis were actually true. When we say we are doing a test under the assumptions of the null hypothesis, all we are saying is that the test is conducted under the assumption that the null hypothesis is true, which really only affects how we interpret our p-value. 

If it is unlikely that we see an observation assuming the null hypothesis to be true, we can conclude that the null hypothesis is false, and there is sufficient evidence in the data to reject it. However, if it is likely that we do see an observation assuming the null is true, we would fail to reject the null hypothesis. Notice how we said fail to reject and not simply accept it. When we fail to reject the null hypothesis, we aren’t saying the null hypothesis is true; we are only saying that there isn’t enough evidence in the data to support the alternative hypothesis claim. 

Since we know that when we are finding probability, we also know we are finding an area under the curve. The figure below provides a visual of the area we are trying to find for our example hypothesis test with housing price.

To conduct the hypothesis testing on housing grade, the data was first sectioned off so that group one contained houses grades 7 through 13, and then group two contained houses grades 1 through 6. Note that on the grading scale, a grade of 1 is the lowest and a grade of 13 is the highest. Before conducting any portion of the hypothesis test, it is important to first establish a null and alternative hypothesis:

  • Null Hypothesis: There is no statistically significant difference between the price of homes with a grade between 7 and 13 and the price of homes with a grade between 1 and 6. 
  • Alternative Hypothesis: The price of homes with a grade between 7 and 13 is statistically significantly greater than the price of homes with a grade between 1 and 6.

Statistically, this is how to express the null and alternative hypothesis:

  • Null Hypothesis
  • Alternative Hypothesis

When translating statistic language, the actual letters being used to represent variables actually reveal a lot of information. When a Greek letter is used, that means it represents a population parameter. When a Latin letter is used, which is just the alphabet used for modern English, that means it represents a statistic from the sample. For example, “σ” represents the population standard deviation whereas “s” represents the standard deviation of the sample.

Here is a brief overview of the process of hypothesis testing. Following this portion, this process is applied to the housing data to investigate if houses graded 7 through 13 are priced higher than houses graded 1 through 6. 

  1. Find the test statistic with the following formula:

      1.  = the mean price of houses with grades 7-13
      2.  = the mean price of houses with grades 1-6
      3.  =  the standard deviation of the price of houses with grades 7-13
      4.  = the standard deviation of the price of the houses with grades 1-6
      5. n = sample size for each group respectively
      6. NOTE: In Excel, =STDEV.S must be used when finding s1 and s2 since the standard deviation is being calculated from a sample, not a population
  2. Calculate p-value with the following formula
    1. p-value = 1- T.DIST(t*, df, 1)
      1. This T.DIST function will take a test statistic and find the corresponding area under the curve, or in other words the p-value
      2. df = degrees of freedom = n-2 where n is the sample size for the entire data set
      3. NOTE: You must do 1-T.DIST(t*,1) because Excel functions take the area to the LEFT. Since the hypothesis test is one sided in the right direction, you need to take the area to the left and subtract it from 1 to leave us with the remaining area to the right. Note that if the hypothesis test was two sided, then you would multiply T.DIST(t*,1) by 2 since you need to find two areas under the curve.

Here is the process applied to the example data set:

  1. Test Statistic
    1. t* = (541096.73 – 531672.72) /
    2. t* = 1.1631
  2. P-Value 
    1. df = 21613 – 2 = 21611
    2. p-value = 1-T.DIST(1.1631, 21611, 1)
    3. p-value = 0.12

When following this procedure, the p-value works out to be 0.12. Since this is greater than the standard cut off of 0.05, we would fail to reject the null hypothesis. This does not mean that the alternative hypothesis is to be rejected; it means that there is simply not enough evidence to support the claim that houses with grades 7-13 are priced higher than houses with grades 1-6.

However, it should be noted that this does not express the magnitude of the difference, or lack of difference, in the prices. This is where practical significance comes in. In order to see the actual magnitude of the difference in prices, a confidence interval will be calculated. Essentially, this shows us how different these prices actually are, and if they are subjectively different enough to be noteworthy.

Explanation of Confidence Intervals

Confidence intervals are ranges in which the true population parameter lies. These compliment hypothesis tests perfectly; the hypothesis test will indicate whether or not there is statistical significance, and then the confidence interval will give the magnitude of this significance. For example, if there is a statistical significance between the price of pencils sold at Walmart versus Target, calculating a confidence interval gives a range between which the true difference in prices lies. If it is found that the difference in prices is somewhere between $0.10 and $0.25, it can be said that the findings are not practically significant. However, if it is found that the difference in prices is somewhere between $2.00 and $3.00, it can be argued that that demonstrates practical significance and thus relevance for intended audiences such as consumers.

Confidence intervals are conducted at various levels of confidence, indicating the probability that the true population parameter lies within the range. To be more technical, the confidence level says that we have x% confidence that the true population parameter lies within this range. The standard confidence level is 95%, which will be used for this example. When the confidence level is increased, the confidence interval will become wider. When the confidence level is decreased, the confidence interval will become narrower. This will be demonstrated later with the example data with reasoning as to why this logically makes sense.

Here is a visual representation of the task at hand where the red circle represents the multiplier value to first be found:

Here is a brief outline of how to find the confidence interval:

  1. Find the multiplier, t*
    1. This is NOT the same t* that used during hypothesis testing; it is a different multiplier that is found separately
    2. The opposite of the T.DIST function is used, which is T.INV(t*, df). This takes a probability / an area and degrees of freedom, and calculates the corresponding t* value. Referring back up to the figure above, since the t* value to be found includes ALL of the area to the left, the remaining area under the curve needed to be calculated and then added to 0.95. The area under the curve must add up to 1, so therefore the remaining two sections must equal 0.5; due to the symmetrical nature of the bell curve, the two areas on either tail are equal, meaning they are each 0.025. The 0.025 is added to 0.95 to get a total area of 0.975; this will be the area used in the function to find t*.
  2. Find the margin of error
    1. Margin of error = (t*) * standard error
      1. Where standard error = 
  3. Add and subtract the margin of error to the difference in sample means

Here the process worked out with the example housing data:

  1. T-Multiplier
    1. t* = T.INV(0.975, 21611)
    2. t* = 1.960074
  2. Margin of Error
    1. Standard error =  = 8102.85
    2. Margin of error = 1.960074 * 8102.85= 15882.17
    3. Lower Bound = (541096.73 – 531672.31) – 15882.17 =  -$6,457.75
    4. Upper Bound =(541096.73 – 531672.31) + 15882.17 = $25,306.59
    5. Confidence Interval = (-$6,457.75, $25,306.59)

When going through this process with the data set, it indicates that the difference between the two groups of homes is somewhere between -$6,457.75 and $25,306.59, where a negative number indicates that group two is actually priced higher than group one. Practical significance is much more subjective than statistical significance, so this price difference may not matter to some people, but it may to others. It should also be noted that the results of the confidence interval actually bolster the results found in the hypothesis test. In the hypothesis test, it was concluded that there is not enough evidence to support the claim that houses with grades 7-13 are priced higher than houses with grades 1-6. Looking at the CI, notice that the interval contains zero, meaning that there is a chance that the two groups of homes are in fact priced the same. It also includes negative numbers, indicating that it is possible that the houses with grades 1-6 are even priced higher than houses with grades 7-13. This bolsters the findings from the hypothesis testing, indicating once again that the null hypothesis should fail to be rejected. However, if the CI did not contain zero within the interval, that means that there it not a chance the prices are the same, and therefore the null hypothesis could be rejected. This is another way that confidence intervals are used to compliment and bolster the findings in hypothesis testing.

Following the same procedures outlined above, the CI was recalculated by decreasing the confidence level to 90% and then increased to 99%:

CONFIDENCE LEVELLOWER BOUNDUPPER BOUND
90%-$3,904.14$22,752.99
95%-$6,457.75$25,306.59
99%-$11,448.97$30,297.81

When the confidence level was increased, the CI widened. Intuitively, this makes sense since we can be more confident that the population parameter lies somewhere within this range since it contains so many possible values. The root of this is because more possible error is being accounted for as a result of a larger margin of error. 

To conclude, it has been identified that there is not a statistically significant difference between the price of homes with grades 7-13 and the price of homes with grades 1-6. Therefore, the null hypothesis fails to reject, as there is insufficient evidence from the data to support the claim that higher graded houses are priced higher. We have 95% confidence that the difference in price between the two housing groups is between -$6,457.75 and $25,306.59.

This process is easily repeatable, especially with tools like Excel, and therefore can be used to investigate numerous variables and characteristics in the data. Here is a comprehensive table of the results from the hypothesis testing with the housing data set: 

VARIABLESIGNIFICANCECONFIDENCE INTERVAL
WaterfrontStatistically significant(958246.81, 1302378.04)
Renovation*Statistically significant(190329.86, 269706.56)
Renovation Year**Statistically significant(35718.74, 192890.85)
Condition***Not statistically significant(-5383.70, 15105.12)
Year Built****Not statistically significant(-21513.88, 2102.56)

 

* = We looked at whether houses grades that have been renovated were priced higher than houses that have not been renovated

**= We looked at whether or not houses renovated in or after 2000 were priced higher than those renovated before 2000

*** = We looked at whether houses with condition grades 4 or 5 were higher than those with grades 1,2, or 3.

**** = We looked at whether houses built in or after 2000 were priced higher than those built before 2000.

 

Thanks for tuning into our discussion on hypothesis testing! We hope you rejoin us back next time when we dive into linear regression modeling. 

 

Missed the first post in this series? Click here.

Missed the second post in this series? Click here.

Ready for the next post? Click here.

 

 Like

Welcome back! We’re glad you made it to Part 2 of this series. Let’s dive right in. 

One of the most important statistical concepts is the distribution curve. A distribution curve is a graph that shows the different frequencies of a certain variable. For example, if I were to make a distribution curve of scores received on a test, the graph would show how many people received each score. The area under the distribution curve is the probability that you see a particular instance occur. Going back to our previous example, if I wanted to find the probability of a student receiving a score between 80 and 90, I would find the area under the curve in-between those two test scores. The total area under the curve is equal to 1. These concepts will be important later when discussing more about p-value in hypothesis testing. 

In later posts we will frequently be referring to a specific type of distribution called a normal distribution, often referred to as a bell curve. It is referred to as a “bell curve” because it simply looks like a bell! A normal distribution curve is special because it is perfectly symmetrical. 

Sample size actually plays a large role in distribution, especially normal distribution. First, let’s give sample size a brief definition: the number of data points or subjects in your sample, typically denoted as “n.” Another way to think about it is simply how much data you have. For example, if you conducted a survey of 50 students in a high school, your sample size would be 50. The larger the sample size, the more normal your distribution becomes as posited by the Central Limit Theorem. Since many variables are naturally distributed normally, the more normal your sample distribution becomes, the better it is at approximating population parameters. This is easier to conceptualize if you can actually see if for yourself. Below we have put together a demonstration that shows how increasing the sample size causes the distribution curve to be more normal. 

For this demonstration, let’s use our example data set and look at the living room size. Let’s assume that the true population size is 21,613. We began by randomly drawing a sample of x number of living room sizes from the population and then graphed the sampled data in the form of a histogram to show the distribution. Note that the x axis represents the living room size in square feet, and the y axis represents the frequency in which it occurs. 

 SAMPLE SIZEDISTRIBUTION CURVE
With 100 randomly selected sample data, our distribution curve has a wide spread and does not resemble a normal distribution curve at all. This indicates that a sample size of 100 is inadequate.
With 500 randomly selected sample data, the distribution curve has decreased spread and begins to look more normal than that sample of 100. This is a better candidate than the sample of 100, but we can still get a better model by using more data points.
With 1,000 randomly selected sample data, the data decreases its spread once again, as well as becomes approximately more normal. This is a better candidate than a sample of 500, but can become more normal with more data points.
With 10,000 randomly selected sample data, this is where we see a normal distribution curve begin to truly emerge. Since the spread is decreasing, we know that the number of outliers is also decreasing.
With 20,000 randomly selected data points, the curve looks about the same as the sample of 10,000, except it looks slightly more normal and has slightly less spread. This makes it a better candidate than 10,000 points, but likely only marginally.
Here is the distribution curve of the “population” of 21613 data points. Obviously, this has the most minimal spread and the most normal distribution. The sample of 20,000 points most closely resembles this curve.

 

As we can see, as our sample size increased, the distribution curve resembled our population curve more accurately. Once we gathered a sample size of 10,000, an approximately normal distribution curve emerges and improved as we increased the sample size to 20,000. This shows why it is so important to have a large enough sample size if we want to draw valid conclusions about the population from our sample. Something to note, however, is that the sample of 10,000 data points produces a curve that looks very similar to the 20,000 and 21,613 curves; although the 20,000 technically best represents the population, we can say that we will have very similar, if not identical, results when cutting that sample size in half to 10,000 since their respective curves are mostly the same. This helps to save both time and resources since you will have to collect significantly less data.  

More broadly, it also shows us why we do not necessarily need to use the full population if we want to accurately answer our questions. Why would I spend all of the time and money collecting 21,613 data points, when I could find the same conclusions with only, say, 10,000 data points? This is exactly why we use sample sizes and not populations for statistical exploration. As long as is is large enough, we can use a sample size that will accurately represent our target population.

But how do we know exactly what our sample size should be, and what is deemed large enough? Obviously we want as large of a sample size as possible, but due to finite time and resources, it may be necessary to simply find the minimum number of subjects required for valid conclusions about the population. Finding this value depends heavily on how accurate you want your results to reflect the population, which will largely be dictated by your margin of error. The smaller your margin of error, the larger your sample size will have to be. After setting a margin of error, you can use the following formulas to calculate the necessary sample size based on the type of method in use: 

  • If you are looking at the percentage / proportion of a population that contains a certain characteristic:
    • p = the estimated percent of the population that contains the attribute in question 
    • q = p-1 = the estimated percent of the population that does not contain that attribute
    • z* = z multiplier 
      • ex: 95% confidence level → = Z.INV(0.975) = 1.96 (don’t worry, we go through an example of how to do this step later in the section) 
  • If you are looking at a regression model, this one is a little bit more flexible and subjective
    • Green (1991) → n >= 104 + k
      • k = number of predictors
    • Rule of thumb → 10 subjects for every one predictor 

Note that if the calculated sample size has decimals, it is standard to round up to the nearest whole number, regardless of the value of the decimal. For example, a calculated sample size of 145.129 would be rounded up to 146, even though traditional rounding rules would indicate not to. Also note that margin of error must be expressed in decimal form when used in the formula. (ex: 5% → 0.05)

These formulas might seem like a foreign language if you’re non-technical, but don’t despair! They’re actually much more simple than they look. You have all of the information, you just need to plug it in and voila! All of the variables that you are plugging in are either numbers you already have or numbers you are determining yourself. If you are still a little bit confused, don’t worry! Before going on to the next section, we will run through an example for you. Let’s set the scene: you are trying to find out what percentage of houses in King County, WA have been renovated, but you are confused about how many houses are necessary to collect data on for accurate results. We will begin by finding each value we need, and then putting it together by plugging the numbers into the formula and finding our sample size. Let’s begin! 

  • p → we can estimate that approximately 50% (0.5) of the houses have been renovated; this is just an estimate, and does not need to be accurate
  • q → 1-p → 1-0.5 = 0.5; we can estimate that approximately 50% (0.5) of the houses have not been renovated; once again, just an estimate, so don’t get too caught up in figuring out this value
  • z* → let’s say we decided we want 95% confidence in our results, we would find our z* multiplier by calculating =Z.INV(0.975) in Excel, which gives us 1.96; 95% is a pretty standard confidence level to use
    • Refer the Excel Guide for an explanation of why we input 0.975 
  • Margin of error → let’s say we decide we want our results to be within 3 percentage points of the true population parameters, then our margin of error would be 3%, but would be put in as 0.03 in the formula.
  • NOTE: When we set our confidence level and margin of error, we are saying: “We want to have x% confidence that our results are within y percentage points of the true population parameter where x represents the confidence level and y represents the margin of error.”
  • n = 1067.11 → round up → 1068 housing points 

 

Missed Part 1 of this blog series? Click here.

Ready for Part 3? Click here.

 Like

Welcome to data science 101, your guide to important introductory topics in predictive analytics. If you are an analytics specialist looking to learn more about data science, or even just someone interested in learning more about the topic, then this is the series for you!

We want to show you that if you do not have a strong background in statistics, or if you are not considering yourself as a data scientist, you can still effectively work with and analyze data. Throughout this four part series, we will go through the following topics: 

  1. Sampling and Distribution (click HERE to read)
  2. Correlation and AB Hypothesis Testing (click HERE to read)
  3. Single and Multiple Linear Regression (click HERE to read)

Throughout this series we will use a real data set to help explain and demonstrate the discussed concepts. Should you choose to follow along with us, we have all of our work available to you in the following links: 

  • For the Excel enthusiasts, click HERE to download the Excel workbook file
  • For Python gurus, click HERE to download a zip file that contains the Jupyter Notebook and necessary csv files. When in Jupyter Notebook, only open the file “Blog Post Data Set and Tests (Code).ipynb” to access the notebook with the code. The other files within the folder are just csv files of the data to be read in the notebook. 

All of the calculations were done in both Excel and Jupyter Notebook, but all of our explanations are done using Excel. The Jupyter Notebook file will show you how we coded the same tests as well as the graphics and figures using Python. 

Before jumping into any tests or analysis, we first go through how to approach your raw data set. We have a lot to get through, so let’s get started!   

The Basics: How to Look at Your Data

Before we even begin to analyze our data, it is important to first assess our raw data. Ensuring you have an in depth understanding of your data allows you to better manipulate it and draw better conclusions. When collecting or selecting data, it is important to keep two overarching questions in mind: 

  1. Who is my target audience?
  2. What question am I trying to answer?

When thinking of your target audience, you want to make sure that your data is applicable and appropriate for that population. For example, if you were exploring housing prices for an intended US audience, it would be more beneficial to find data for houses in the United States rather than in Europe or Canada. When thinking of your question, you must ensure that your data is actually is viable for answering it. Think about the type of data you are looking for, whether it be qualitative or quantitative. For example, if you were looking at tumor sizes and dimensions, quantitative data would be more advantageous to use, but if you were looking at tumor attributes, such as shape and texture, qualitative data would be better to look at. In addition, it is important to gather any and all information that you may have about the data set. For example: 

  1. Where was the data collected from?
  2. Who collected the data?
  3. When was the data collected?
  4. How large is the data set?

Some of these questions may be more relevant than others, especially depending on whether your data is primary, meaning it was collected yourself firsthand, or secondary, meaning it was collected by someone else that you intend to use secondhand. The goal of data analytics is to answer a question, and that begins with information outside of just the numbers. 

Throughout this series, we will apply these thought processes and methods to a real, secondary data set pertaining to housing prices. This data set contains a sample of 21,613 houses describing their price, number of bedrooms, number of bathrooms, number of floors, grade, condition, size of the living room, size of lot, square footage of the house above ground level, size of the basement, the year it was renovated, and the year it was built. The data was collected from King County in Washington in 2015. 

We are looking to answer the question: what characteristics of a home affect its price? This is an extremely overarching question that will require multiple methodologies to reap the answers we want. In the next section, we will briefly discuss key statistical terms and methodologies so you can better understand the proceeding content.

 

Key Statistical Terms and Definitions

Before we dive in, let’s talk definitions. Just like any other field of study, statistics has its own language, or jargon. If you aren’t a native statistics speaker, understanding this language may seem a little bit confusing, but hang with us! Throughout the series we will ensure that you can decode some of the tricky statistics language that we use, starting with this brief section where we define some critical terms and topics that you will hear frequently throughout. If you ever get confused in latter sections, just pop back up here and read through our explanations. Here is a list of some of the important terms we will be using throughout this series: 

  • Primary Data → Data that has been collected by you, firsthand
    • Ex: Surveys, interviews, experiments, focus groups, etc.
  •  Secondary Data → Data that has been collected or produced by someone else and used secondhand by another person other than the researcher
    • Ex: Information libraries, public government data, population census, etc. 
  • Statistical Significance → In essence, this just means that something is of importance and worth investigating. If something is statistically significant, to be a bit more technical, that means that the relationship between two variables is likely not due to chance alone and can be attributed to something else.
  • Practical Significance → Tells us how applicable or relevant our findings actually are; shows the magnitude of our statistical significance
    • P-value → p-value, or probability value, is possibly one of the most important statistical terms to be discussed. This metric places a numerical value on statistical significance. There are different p-value cut-offs that can be used, but it is standard to say that a p-value less than or equal to 0.05 indicates statistical significance (we will be using this cutoff throughout the remainder of the series). 
  • Discrete Variable → A variable that can only take on a finite number of values
    •  Ex: Number of rooms, number of floors, number of pets
  • Continuous Variable → A variable that can take on an infinite number of values
    •  Ex: Height, square footage of a house, weight
  • Hypothesis Testing → Just like a science experiment, in hypothesis testing, we are simply trying to test a hypothesis or a “prediction” that we may have. In hypothesis testing, we will form two types of hypotheses: 
  • Null Hypothesis → Put simply, this is our hypothesis, or statement, that there is nothing going on between our two variables.
    •  Ex: There is no statistically significant difference between the price of pencils at Target and the price of pencils at Walmart 
  • Alternative Hypothesis → This is the claim that we are trying to prove with the hypothesis test; once again, put simply, this means that there is, in fact, something going on between our two variables worth noting 
    •  Ex: There is a statistically significant difference between the price of pencils at Target and the price of pencils at Walmart
    •  Note that all hypothesis testing is done under the assumption that the null hypothesis is true
  • Dependent Variable → Also known as the “y” variable
  • Independent Variable → Also known as the “x” variable; the variable we think has some effect on the dependent variable
  • Linear Regression → This is a commonly used methodology to see if one variable can be used to predict another. There are two types of modeling methods: 
  • Single Linear Regression → Seeing if one, independent variable, or predictor, is good at predicting a dependent variable
    •  Ex: Is an SAT score a good predictor of GPA?
  • Multiple Linear Regression → Seeing if multiple independent variables / predictors have a relationship with a dependent variable
    •  Ex: Are AP, SAT, and ACT scores good predictors of GPA?
  • Correlation → The relationship between variables, typically denoted as r; r is between -1 and 1, with -1 being perfectly, negatively correlated and 1 being perfectly, positively correlated. The closer the r value is to |1|, the stronger the association. If your r value is 0, that means that there is absolutely no correlation between the two variables, so the closer the r value is to 0, the weaker the association. 

 

If you’re ever confused about certain terminology used throughout, you can jump back up here for a quick refresher!

Can’t get enough? Well neither can we! In our next installment of this series, we will dive into the importance of the distribution curve and sample size, two concepts that are imperative for setting the stage for most of our consequent statistical testing. Click here to read. 

 

 Like

PostgreSQL11 became available on Amazon Relational Database Service (RDS) in March. Have you tried it? We have, and are here to report all of the awesome enhancements. As a preview, there are major improvements to the table partitioning system, added support for stored procedures capable of transaction management, improved query parallelism, added parallelized data definition capabilities, and just-in-time (JIT) compilation for accelerating the execution of expressions in queries. We’ll now go more in depth about each of these improvements, and by the end of this, trust me, you’ll want to go give it a try!

Improvements to partitioning functionality

  • Partitioning can now be created on hashing a key column
  • Supports for PRIMARY KEY, FOREIGN KEY, and indexes on partitioned tables
  • Partitioned tables can have a “default” partition to store data that does not match any of the other defined partitions
  • On UPDATES, rows are moved to appropriate partitions if partition key column data changes
  • Faster partition elimination during query processing and execution speeds up SELECT queries

Lightweight and Fast ALTER TABLE for NOT NULL Column with DEFAULT Values

  • With this new version, ALTER table doesn’t do a table rewrite when adding a column with non-null default values. This significantly helps when altering tables with millions of records where a new column is added with a default value.

Stored Procedures with Transaction Control

  • Finally, Postgres 11 supports creating stored procedures. Prior versions of Postgres supported functions, however, functions cannot run transactions. With the support of Stored Procedures you can now COMMIT and ROLLBACK transactions with the Stored Procedure.

Improvements to Parallelism

  • CREATE INDEX can now use parallel processing while building a B-tree index
  • Parallelization is now possible in CREATE TABLE…AS, CREATE MATERIALIZED VIEW, and certain queries using UNION
  • Hash joins performed in parallel
  • Improvements to partition scans to more efficiently use parallel workers
  • Sequential scans now perform better with many parallel works

Optimizer

  • Selection of the most common values (MCVs) has been improved. MCVs earlier were chosen based on their frequency compared to all common values. In Postgres 11, MCVs are chosen based on their frequency as compared to non-MCV values
  • Selectivity estimates for >= and <= has been improved. This improves the performance using BETWEEN
  • Improvements to the optimizer row counts for EXISTS and NOT EXISTS queries

Optimal Just-in-Time (JIT) Compilation

  • Just-in-Time (JIT) compilation is the process of turning some form of interpreted program evaluation into a native program, and doing so at run time. JIT would be beneficial for CPU bound queries. JIT currently aims to optimize two essential parts of query execution: expression evaluation and tuple deforming.

Expression evaluation is used to evaluate WHERE clauses, target lists, aggregates, and projections. It can be accelerated by generating code specific to each case.

Tuple deforming is the process of transforming an on-disk tuple into its in-memory representation. It can be accelerated by creating a function specific to the table layout and the number of columns to be extracted.

I know you won’t believe it, but these aren’t even all of the benefits of the new PostgreSQL 11. There are so many improvements for Window functions, indexes, and monitoring that would be greatly beneficial. If that doesn’t get you excited, I don’t know what will! The best way to use PostgreSQL 11 is with Amazon RDS. Reach out to our team if you’d like to get started with AWS or want to unlock the full potential of your current environment!

 Like

Let’s admit it – managing licenses is difficult. This complex process often involves manual or ad-hoc reporting that can quickly become outdated or result in inaccuracies. Within AWS, licenses are used across a variety of tools, which only making the situation worse. We’ve heard this compliant many times from our customer base, and decided it was time to introduce a solution: AWS License Manager!

 

This service is available to all AWS customers and provides an easy way to manage licenses in AWS and on-premises servers from software vendors like Microsoft, SAP, Oracle, and IBM. Here are four reasons why you should take advantage of this service:

 

 

  1. It’s Simple

AWS License Manager gives you a single, centralized view that allows tracking of all the licenses across AWS and on-premises. You can track how many licenses are being used, how many are available, and how many have breached limits – all on the built-in dashboard. AWS License Manager integrates with AWS services to simplify management of licenses across multiple AWS accounts, IT catalogs, and on-premises from one AWS account.

  1. You Have Control

As an Administrator, you can create your own custom licensing rules that fit the terms of your licensing agreements, giving you control over license usage. These rules can be made centrally or you can specify different rules for various groups in your organization. These rules will be enforced when an EC2 instance is launched. With AWS License Manager, you have visibility over how software licenses are used and can prevent misuse before it happens.

  1. Lower Costs

Say goodbye to wondering if the right number of licenses are being used or worrying if additional licenses are required! AWS License Manager does all of this for you, saving you the time and costs of tracking and managing licenses. You can also enforce controls on software usage to reduce the chance of overages.

  1. Reduced Risk of Violations

The consolidated view of your licenses reduces the risk of non-compliance. Additionally, the rules administrators set can limit violations for using more licenses than an agreement stipulates, and by reassigning licenses to different server on a short-term basis. It’s possible to limit a licensing breach by stopping the instance from launching or by automatically notifying the administrators about the infringement.

Well, there you have it – four reasons why you should use AWS License Manager and 0 reasons why you shouldn’t (because they don’t exist)! Do yourself a favor and start using this service to keep you compliant and to save you time, effort, and money.

If you have issues with set-up or have questions about the service, feel free to contact us!

 Like

It’s been over 11 years since AWS began supporting Microsoft Windows workloads. In that time, AWS has innovated constantly to maintain its title as the #1 cloud provider for these workloads. You can run the full Windows Stack on AWS, including Active Directory, SQL Server, and System Center.

Many third parties have completed studies that show why AWS is superior when it comes to performance, cost, and reliability. In 2018, the next-largest cloud provider had almost 7x more downtime hours than AWS. Additionally, SQL Server on AWS boasts a 2-3x better performance record. When costs are calculated correctly, SQL Server running on AWS’s competitor’s platform would be almost twice as much. This includes the cost of storage, compute, and networking.

Reliability is the quality that puts AWS high above the rest. AWS has 64 availability zones within 21 different regions. AWS customers can deploy their applications across multiple zones in the same region for fault tolerance and latency. Instead of having a single-region instance that scales up, AWS’s services are divided into smaller cells that scale out within a region. This design reduces the effects when a cell-level failure occurs. Notably, AWS has never experienced a network event that spans multiple regions.

When migrating your SQL Server Workloads to AWS, there are a few things you should consider. It’s important to optimize your total cost of ownership, which includes optimizing your workloads to benefit from the scalability and flexibility of the cloud. On-premises servers are not optimized, in fact, 84% of workloads are over-provisioned. Many Windows and SQL Server 2008 workloads are running on older and slower server hardware. To optimize your cloud migration, you need to size your workloads for performance and capability, not by physical servers. To reduce cost, you can also decrease the number of licenses that you use by server and core counts.

Another strategy is to decouple your storage and compute processes. When these are combined, they must be scaled together. On the cloud, compute and storage can be separated. Decoupling makes elasticity easier to achieve and manage. Many people question this because SQL Server instances often contain logic to ingest or process data before it is stored in a schema. Many time, ETL logic is written within SQL Server processing engine and the servers are sized to handle a large volume of ETL processes. These ETL processes often run a couple of times a day and the capacity is only needed during the time it is running. By moving the ETL logic outside of the SQL engine, you can utilize the elasticity of the cloud and expand your compute power whenever needed. This will reduce your SQL Server footprint in the long-run. Of course, this doesn’t apply to every use case in SQL Server, but if you have ETL logic, enrichment logic, or load logic inside your SQL Server, decoupling might be the correct choice. This was the case with one of our customers, Telarix. See here to read their white paper.

As part of your migration, you should consider running your SQL Server on a Linux Instance within the AWS platform. The majority of current Windows functionality is supported on the Linux platform. Additionally, there is a minimum of a 20% cost benefit of running SQL Server on a Linux instance! This decision will give you the best performance and save you the most money.

You can also use Amazon RDS to upgrade your database instance to SQL Server. This is performed in place and is initiated with just a couple of clicks. Before you upgrade, you can create a snapshot backup, use it to create a test DB instance, then upgrade that instance to the desired new version. You can also opt-in to automatic upgrades that take place within your preferred maintenance window. 

If you’re considering migrating your Windows Workloads to the cloud, AWS is the optimal choice because of the price, performance, and reliability. This is the perfect time to migrate and modernize your outdated servers. Contact us to learn more or get started on your project.

 Like