# August 2019

Welcome back to our fourth and final installment of our series! In this post, we will be discussing single and multiple linear regression, so let’s jump right into it.

Making Predictions Using Single Linear Regression

Linear regression is a statistical modeling tool that we can use to predict one variable using another. This is a particularly useful tool for predictive modeling and forecasting, providing excellent insight on present data and predicting data in the future. The goal of linear regression is to create a line of best fit that can predict the dependent variable with an independent variable while minimizing the squared error. That was a pretty technical explanation, so let’s simplify. We are trying to find a line that we can use to predict one variable by using another, all while minimizing error in this prediction. In order for this line to be helpful, we need to find the equation of the line which is as following:

• ŷ → the predicted dependent variable
•  → the slope of the line
• x → the independent variable aka the variable we are using to predict y
• → the intercept of the line

This equation may look familiar, and it should. It is the same equation as your standard y= mx + b that you learned back in Algebra I, just written a little differently in statistics language.

Some of these concepts are difficult to understand on their own, so let’s apply them to our example housing data set to see them in action.

When we calculated correlation, we found a pretty strong correlation between the size of the living room and the price. Now that we know that there is something going on between those variables, we want to see if the living room size can be used to predict the price of our house. Before we jump into calculations, we first want to make sure that we can actually use the linear regression methodology; this means we need to check to see if the relationship between our living room size and price is approximately linear.

On the left we have all of our data points scattered, and on the right we have the same graph except the line of best fit that we are trying to find is included as well:

When we scatter our data, we find that the relationship between the living room size and the price is approximately linear, meaning we now have the thumbs up to begin our linear regression.

Testing single linear regression can be done by hand, but it is much easier and quicker to use tools like Excel or Jupyter Notebook to create predictive models. In order to conduct a regression analysis in Excel, you need to make sure that your Analysis ToolPak is activated. How to activate the ToolPak will depend on what version of Excel you are running, so I would recommend just doing a quick Google search on how to set it up. It is fairly simple, so this shouldn’t take any longer than a minute or two. Once it is activated, you are good to go!

Here is a brief outline of how to conduct your regression analysis using Excel:

1. Select “Data” tab → Select “Data Analysis” → Select “Regression”
2. Input Y Range → select the data for the variable you are trying to predict; dependent variable
1. With our example: Price
3. Input X Range → select the data for the variable you are using to predict the other variable; independent variable
1. With our example: Living room size
4. Select the “Labels” box so that your output sheet will include the corresponding data labels
5. If you want your regression output to be inserted onto a new worksheet page, check the box for “New Worksheet Ply”
6. Check the boxes that say “Residuals” and “Residual Plots;” this will be important later on in our analysis and conclusions

When you press enter, you will be directed to a new workbook sheet that will have a bunch of output tables that may seem overwhelming and confusing, but don’t worry! We are going to show you which numbers are actually important for you to look at.

We have highlighted all of the boxes / numbers that are important to look at. Let’s walk through these individually.

1. R Square → the r square value tells us how good our model is at predicting the dependent variable. The closer the number is to 1, the better our model is. With this output, we see our r square value is 0.4949, which means that 49.49% of our data can be explained by our model.
2. Intercept Coefficient → the intercept coefficient value tells us our  value, which is the intercept of the regression line; this tells us that when the square footage of the living room is zero, the price of the house is -\$43580.74
1. NOTE: The intercept is not always applicable, which is true in our instance since a negative house price is impossible
3. Sqft_living Coefficient → this gives us the slope of our line, which is ; the slope tells us as the square footage of the living room increases by one square foot, the price of the home increases by \$280.62.
4. Sqft_living p-value → this will tell us if the predictor in question, which in our case is the living room size, is a good predictor of our dependent variable. In this instance, with a p-value of approximately zero, we can conclude that the living room size is a good predictor of housing price.
1. Note: You can also look at the Significance F value, but if your predictor is statistically significant, the model will be too and vice versa

Now let’s put it all together to create the equation of our line:

Predicted Price = 280.62(living room size) – 43580.74

Now, ideally, given a living room size, we can make a prediction of the price of any given house.

Now that we have built our model, it is important to then look at our residuals, otherwise known as errors. The residual / error, typically denoted as “e,” is essentially how far off our prediction was from the actual price. This is given by the following equation:

e = actual – predicted

e = y – ŷ

Visually, if the data were to be scattered and the line of best fit laid on top, then the error values would be the distance from the actual observed value to the line at that specific x value, which would be the predicted y value. Here is a visual to help conceptualize this:

When our error is negative, that means that we overestimated the price; when our error is positive, that means we underestimated the price. When our error is zero, that means our predicted price was accurate with the actual housing price given. The smaller the error, the more accurate our prediction is.

When looking at error, we want to ensure that we are equally overestimating and underestimating our data, as well as that errors are occurring randomly. This is why when we are constructing our data analysis we want to include residuals and residual plots. How do we make sure these assumptions are being upheld? Excel will calculate our all of our errors and then scatter them on a graph with the x axis being the independent variable and the y axis being the error value. We will look for the following characteristics on the graph to ensure these assumptions are upheld:

1. The data points are scattered evenly around y = 0
1. Equally over and underestimating the dependent variable
2. The residual plot follows no pattern
1. Errors are occurring randomly

Here is an example of a residual plot that upholds these assumptions relatively well:

Here is the residual plot for our single linear regression analysis with living room size:

Looking at the plot, we can see the residuals are approximately scattered around zero, meaning we are equally over and under estimating the price. However, we do notice a pattern, meaning the errors are not necessarily occurring randomly. Looking at the graph, we notice a horn shaped pattern, meaning that as the living room size increases, so does the error. This indicates a heteroscedasticity problem, which is a much simpler concept than the word is to pronounce. All this means is that the residuals are occurring unevenly, resulting in the pattern we see above. In order to uphold the regression assumptions, we would want to see a pattern of homoscedasticity, meaning that the residuals are occurring evenly throughout. This breach of the assumption is very important to note, because that means that the living room size may not be a good candidate for predicting house size. This means we should look more into other predictors and find a better candidate. Let’s take a look at the basement size.

Here we see that the that the p-value for basement size as a predictor and the F value for the model are both zero, meaning that the model is a good predictor of price. Moving to the r square value, we find that it is less than that of the living room model; this means that the model with the basement size explains much less data than that of the living room size. However, the issue that we came across with living room size lied in the residual plot, so let’s take a look at that.

We first notice that the residuals are evenly scattered around zero. Then we notice that, unlike the living room size residual, there is no apparent pattern in the plot. This means that there is homoscedasticity and therefore is a better candidate for predicting price than the living room size is.

Once again, this process is easily repeatable due to tools such as Excel. We have gone ahead and tested the remaining viable predictors and created a comprehensive table of our findings:

 PREDICTOR STATISTICALLY SIGNIFICANT? (Y/N) RESIDUAL PLOT Living Room Size (in sqft) Yes Heteroscedastic Living Room Size 2015 Yes Homoscedastic Basement (in sqft) Yes Homoscedastic Lot Size (in sqft) Yes Heteroscedastic Lot Size in 2015 Yes Heteroscedastic Square footage of the house above ground Yes Heteroscedastic

Of the single predictor models, we can see that the basement model and the living room size in 2015 model uphold the assumptions the best out of all of the other predictors. However, it should be noted that both r square values were modest, so we need to see if there is a better model out there. We know that one variable likely is not able to be the soul predictor of housing price, so what if we combine multiple predictors into one model? That is where multiple regression comes in.

More Predictive Modeling with Multiple Linear Regression

We are going to build off of single linear regression, so if you are still confused about that, it may be beneficial to go back through and review the previous section until you feel you comfortable with the concepts. If you are good to go, then let’s venture on to multiple linear regression!

Multiple linear regression is very similar to single linear regression except it is a little bit more complex. Instead of looking at just one predictor, we are now going to incorporate multiple predictors into our model to see how good they are at predicting price together. So with our example, instead of looking at just one predictor of housing price at a time, we can create one model that incorporates all of our different predictors to see if we can better approximate the price.

One of the assumptions for multiple linear regression, just like single linear regression, is that the multivariate regression line is approximately linear. However, unlike our single linear regression line, it is near impossible to confirm whether or not we uphold this assumption since our line cannot be conceptualized on a 2-D plane. This does not pose as a serious obstacle since deviation from the assumption does not impact our results in a major way. However, it is a good rule of thumb to look at each variable’s linearity individually just to ensure that there is nothing that poses as a serious threat to the assumption. When looking at the variables we will be exploring, all of them follow an approximately linear relationship and therefore should uphold our linearity assumption for the multivariate model.

Once again we will be using the Analysis TookPak in Excel, so make sure that is good to go once again. Head to the data tab and select data analysis and regression, just like we did with the single linear regression. The only thing that we are going to change here is our “X Input Range” values. Instead of inputting just one predictor, we are going to input all of our different predictors at the same time; so in this instance: the living room size, the lot size, and the square footage of the house above ground level. Make sure you select the “label,” “residual,” and “residual plot” boxes as well. Press enter, and you are good to go.

Since multiple linear regression is more complex than single linear regression, it is fitting that there are a few more numbers that we will have to look at when doing our analysis. We went ahead and highlighted them for you:

Once again, let’s go through each of these numbers individually:

1. Adjusted R Square → As you increase the number of predictors in your model, your r square value will artificially inflate. That is why there is an adjusted r square value that accounts for the increase in the number of predictors to give a deflated and more accurate figure.
2. Regression Significance F → This is like a p-value but for our entire model; it will tell us whether or not our model is a good fit for predicting our dependent variable, which in this case is price.
3. Intercept Coefficient → Just like in single linear regression, this will tell us the intercept for our model.
4. Predictor Coefficients → The remaining coefficients for our different predictors will give us the slopes for each independent variable respectively.
5. P-value → This tells us whether or not the variable is good at predicting price given that the other variables are already included in the model.
1. Ex: Living room size is a good predictor of price when lot size and the square footage of the house above ground are already included in the model

Now we can put these together to create our equation:

Price = 269.15(living room size) – 0.2782(lot size) – 16.90(sqft above ground) – 41445.124

Now looking at “Significance F,” we see that it is approximately zero. What does this mean? For consistency, let’s keep our significance level, or cut off value, at 0.05. Since our F value is less than our cut off, just like a p-value, we can say that our model is a good fit for our data and for predicting our dependent variable. This is a great tool for comparing different models. For example, let’s say we created another model that had an F value greater than 0.05, then we know it is not a good fit for the data, and therefore we should go with our other model since it statistically proves to be a better fit.

But what if there are multiple models that prove to be significant? How do you pick which one is best? This is where it gets to be a bit more subjective and flexible. To do this, you will have to examine and compare your adjusted r square values. The higher your adjusted r square value is, the better your model is at explaining the data and therefore is a better model for predicting our dependent variable. However, if we increase our predictors and we only see a slight increase in the r square value, it technically is a “better model,” but we can conclude that the additional predictors are not really contributing to the model in a significant way since they don’t help explain that much more of the data.

So let’s say that you drop all variables except for one and you find that your r square value is similar to that of the model that includes all of the possible variables, which will be referred to from here on out as the “extended model.” What does this result mean? This means that with one predictor, you are able to explain approximately the same amount of data as you can with the extended model. And what does that mean? That means that your other variables are mostly just deadweight, not really contributing anything meaningful to your model. Your model with just one variable is able to do just as good of a job as the model with multiple variables. You can then make the model more simplistic and concise by dropping variables that don’t significantly impact the model. This is what makes F testing flexible; you are able to play around with a multitude of different models and see for yourself which variables are significantly impacting your dependent variables and which aren’t.

We played around with some different models and put them into a comprehensive table for you below:

 PREDICTOR(S) SIGNIFICANCE ADJUSTED R SQUARE Living room size, lot size, sqft of above ground Significant 0.4942 Living room size and lot size Significant 0.4938 Living room size and sqft of above ground Significant 0.4932 Lot size and sqft of above ground Significant 0.3671 Living room size Significant 0.4928 Lot Size Significant 0.00799 Sqft of above ground Significant 0.3667

This is a lot of data to digest, so let’s break it down so it is a bit easier to comprehend. We don’t need to bother looking at the significance since all of the tested models proved to be statistically significant. Does that mean that they are all good models? Technically yes, but we can pick the best model by looking at our adjusted r square number. We see that the highest r square value is the model that includes all of our variables; note that from here on out we will be referring to this as the “extended model.” Does that mean the extended model is the best model? Once again, technically yes, but let’s dig a little deeper. We see that our model that only includes the living room size has an r square value that is within hundredths of the extended model. This means that the model with only living room size can explain the same amount of data as the extended model. What can we do with this information? We can conclude that the lot size and square footage of the house above ground level are not really contributing that much since they are not really helping to explain more of the data, so we can drop them from our model and get almost identical results. It can’t hurt per se to use our extended model, but using just the one variable makes our model more concise and easier to work with, leaving less room for errors and complications.

How do we know when adding a predictor to our model is useful? Let’s compare our model that includes only the lot size to our model that includes both lot size and living room size. When we add the living room size to the lot size model, our new model is now able to explain 48.581% more data. That’s a lot! This shows us that adding the living room size to our lot size model is extremely valuable since our r square value, and thus ability to explain the data with our model, has increased drastically. We also see our model be able to explain 12.71% more data when we add living room size to the lot size and square footage of the house above ground model, which is modest but still nonetheless impactful and noteworthy.

Thanks for sticking with us through the series! We hope that you have taken something away from these posts that can help you to better look at and analyze your data. We also hope that we have demonstrated to you that you don’t need to be a data scientist or a statistics expert to turn your data into insight.

For this blog series we ran quite a few numbers and tests in both Excel and Jupyter Notebook, so once again, if you are interested in seeing exactly what went into the calculations and graph / figure creation, follow the links below to be directed to all of the work we did with our data. Within the Excel worksheet you will find lots of data and lots of tests. The first worksheet includes all of the data as well as some relevant information about the set and decoder keys. All subsequent pages are various tests that were conducted and reported throughout this paper. Within the Jupyter Notebook file you will also see the data uploaded from a csv file into the worksheet. Python was then used to create the sample size demonstration and all of the graphs / figures you see throughout. In addition, it demonstrates some alternative methods for calculating correlations, line parameters, and errors that we used to check our numbers with the Excel calculations. These functions that we have defined make calculations and testing highly repeatable, we would recommend using them with your own data set for those of you who are interested in using Jupyter Notebook instead. Do note, however, that the results on both Excel and Jupyter Notebook are identical, and neither is more accurate or beneficial than the other. This is simply just a way to show how you can perform these analyses on either platform depending on your skillset and knowledge.

Link HERE for Jupyter Notebook file

Microsoft Excel Guide

• =AVERAGE(array)
• Takes the average of the array
• =STDEV.S(array)
• Standard deviation of the sample
• =CORREL(array1, array2)
• Finds the correlation between data array 1 and data array 2
• =Z.DIST(z*, 1)
• Finds the area to the LEFT of the z* value
• Use “Z” when you are working with a population
• =Z.INV(probability)
• Finds the z* value for the corresponding probability value
• =T.DIST(t*, df, 1)
• Finds the area to the LEFT of the t* value
• Use “T” when you are working with a sample
• =T.INV(probability, df)
• Finds the t* value for corresponding probability value

A Note About the T.INV and Z.INV Functions

We used these functions to find t* and z* multipliers for confidence levels / intervals. When we conducted a confidence interval at 95% confidence, we did not input 0.95 into the T.INV or Z.INV function, but rather 0.975. Here is an explanation as to how to find what value to input into these functions:

The figure below exhibits the task at hand when finding a t* or z* value at a certain confidence level. The blue area represents our confidence level, which is also a probability value. What we want to do is find the t* or z* value that corresponds with this probability value, as represented by the red circle.

In this example, we are finding a 95% confidence level, and therefore are looking at a probability value of 0.95. When using the T.INV or Z.INV function, we need to use ALL of the area to the left of the t* or z* value that we are trying to find; that means that we need to find the remaining area under the curve.

We know that the area under the curve must total 1, so therefore the remaining area much equal 5 (1-0.95). We also know that since we are working with a normal distribution, the curve is perfectly symmetrical. This means that the remaining area on either tail of the curve are equal. If the remaining two sections must equal 5 (0.05) and there are two tails, we know that each area must be 2.5 (5 / 2 = 2.5 or 0.05 / 2 = 0.025). Now we know that the left tail of the curve must be added to the 0.95 to find the cumulative area the the left of our multiplier, leaving us with 0.975 (0.95 + 0.025). The 0.975 value will be what we input into our Excel function to find the multiplier value.

As you conduct these tests more frequently, you will naturally memorize what value to input into your functions at different confidence levels. However, it may be difficult at first, so we recommend drawing out the curve, the area, and the multiplier you are finding to help visualize exactly what is being calculated and needs to be inputted.

Like

In our introductory blog piece, we discussed some high level statistical concepts which you can read HERE if you haven’t already or need a quick refresher. Now, we can finally start to dive more into the data and start running some tests.

Correlation: A First Clue for a Relationship

A good place to start data analysis is by looking at correlation. Correlation, typically denoted as r in statistics language, is the relationship between the independent variable, x, and the dependent variable, y, and lies in between -1 and 1. Put even more simply, correlation tells us if there is a relationship occurring between two variables.

This is an important statistical concept, because it sets the stage for further investigation with our data. Finding the correlation between two values is our first clue that a relationship either exists, or does not exist, between the two variables. Using our knowledge about the strength of the relationship, we can then proceed to find out how good one is at predicting the other using linear regression methods. Essentially, correlation is the groundwork that allows us to determine which relationships are worth further investigation. Our correlation value can provide valuable insight on the strength and the direction of the relationship.

Below is a comprehensive table that you can use to decode your correlation value:

 CHARACTERISTIC INTERPRETATION Positive correlation value Positive direction / association; as one variable increases, so does the other Negative correlation value Negative direction / association; as one variable increases, the other decreases |r| is closer to 1 The closer |r| is to one, the stronger the association |r| closer to 0 The closer |r| is to zero, the weaker the association

Finding our correlation value, and many other statistical tools, is made easy in Excel. The function you will use to find r is: =CORREL(array1, array2). Array 1 will simply be all of the data for your first variable, and array 2 is all of the data for our second variable; the order is negligible. For example, if I was finding the correlation between GPA and SAT scores, I would first select all of my GPA data points, and then I would separate it from the second array with a comma, and then select all of my SAT data points. Then you press enter and voila! You have your correlation value.

Going back to our example data set, we have created a correlation table that shows the correlation between all of our different variables. A heat map has been included on top to help visually show which have strong or weak associations and which have positive or negative associations. The more saturated the coloring of the box, the stronger the association. If the box is more red, the association is positive, and if the box is more blue, the association is negative. For example, a dark red box would indicate a strong, positive correlation, or a lighter blue would indicate a weak, negative correlation.

Perhaps the most important thing to take away from this section is that correlation does NOT imply causation. Two variables may be correlated, but that does not mean that one causes the other; we cannot emphasize this enough. There may be other confounding variables, which are variables that are not accounted for, that may be the cause of the phenomena that we see. For example, let’s say you find there is a correlation between water consumption and the frequency of illness. It would absolutely ludicrous to draw the conclusion that drinking water causes you to be ill! Instead, we would use this finding of correlation to indicate that there is likely something else going on, such as the presence of cholera. While there is a correlation between the water consumption and illness, we know that the consumption of water isn’t truly what is causing the illnesses; it is the infectious bacteria that is contaminating the water that is causing people to become ill. This is why it is important to discern the difference between correlation and causation so you do not inaccurately report findings. Statistics lays down the ground works for your analysis, but proper research is necessary to really come to the true root of the phenomena.

AB Hypothesis Testing: A Statistical Experiment

Think of the science experiments that you used to do back in school. You pose a question that you are trying to answer; for example, how does sunlight affect plant growth? Then, you make an educated guess, or a hypothesis, as to what you think the results would be, and test whether or not your guess is correct. Statistical hypothesis testing is exactly the same; essentially, you think that there is some kind of relationship occurring between two variables, so you run a statistical experiment to see if your guess, or hypothesis, is correct.

This process begins by looking at the correlation value, which will indicate whether or not there are variables worth investigating. For example, there was a high correlation between the grade, the size of the living room, and the square footage of the house above ground. Since the square footage of the house above ground and the size of the living room can take on an infinite number of values, that makes them continuous variables; this means that it would be advantageous to use linear regression to look deeper into their relationship with price, and even investigate if they can be used to predict it. However, the housing grade is a discrete variable, since there are a finite number of values that this variable can take. You cannot receive a housing grade of 2.5, only 2 or 3. Since grade is a discrete variable, it cannot be used in a linear regression model and would be tested best with the AB hypothesis testing method.

From here, we are going to make a claim that we are trying to prove with the data; this is going to be the alternative hypothesis. However, a null hypothesis also needs to be set, which is a statement that essentially says that our inquiry was wrong, and the data does not support the claim made by the alternative hypothesis. However, the claim cannot be verified by just looking at the raw data; we will need a metric. The metric that is used to numerically show statistical significance is p-value. If you are statistics junkie, read our “For Statistics Junkies” section to learn more about the concepts behind p-value and how it shows us significance; if not, you can go ahead and skip it.

For Statistics Junkies : The “p” in p-value stands for probability, which is what we are actually trying to find during a hypothesis test. What is that probability exactly? It is the probability that we see the value we observe in our data, given that the null hypothesis is true. Put more simply, we have found some observed value in our data set, so what is the probability that we see that observation if the null hypothesis were true. If the p-value is small, that means that the probability of seeing that observation is low if the null hypothesis were actually true. If the p-value is large, that means that the probability of seeing that observation is high if the null hypothesis were actually true. When we say we are doing a test under the assumptions of the null hypothesis, all we are saying is that the test is conducted under the assumption that the null hypothesis is true, which really only affects how we interpret our p-value.

If it is unlikely that we see an observation assuming the null hypothesis to be true, we can conclude that the null hypothesis is false, and there is sufficient evidence in the data to reject it. However, if it is likely that we do see an observation assuming the null is true, we would fail to reject the null hypothesis. Notice how we said fail to reject and not simply accept it. When we fail to reject the null hypothesis, we aren’t saying the null hypothesis is true; we are only saying that there isn’t enough evidence in the data to support the alternative hypothesis claim.

Since we know that when we are finding probability, we also know we are finding an area under the curve. The figure below provides a visual of the area we are trying to find for our example hypothesis test with housing price.

To conduct the hypothesis testing on housing grade, the data was first sectioned off so that group one contained houses grades 7 through 13, and then group two contained houses grades 1 through 6. Note that on the grading scale, a grade of 1 is the lowest and a grade of 13 is the highest. Before conducting any portion of the hypothesis test, it is important to first establish a null and alternative hypothesis:

• Null Hypothesis: There is no statistically significant difference between the price of homes with a grade between 7 and 13 and the price of homes with a grade between 1 and 6.
• Alternative Hypothesis: The price of homes with a grade between 7 and 13 is statistically significantly greater than the price of homes with a grade between 1 and 6.

Statistically, this is how to express the null and alternative hypothesis:

• Null Hypothesis
• Alternative Hypothesis

When translating statistic language, the actual letters being used to represent variables actually reveal a lot of information. When a Greek letter is used, that means it represents a population parameter. When a Latin letter is used, which is just the alphabet used for modern English, that means it represents a statistic from the sample. For example, “σ” represents the population standard deviation whereas “s” represents the standard deviation of the sample.

Here is a brief overview of the process of hypothesis testing. Following this portion, this process is applied to the housing data to investigate if houses graded 7 through 13 are priced higher than houses graded 1 through 6.

1. Find the test statistic with the following formula:

1.  = the mean price of houses with grades 7-13
2.  = the mean price of houses with grades 1-6
3.  =  the standard deviation of the price of houses with grades 7-13
4.  = the standard deviation of the price of the houses with grades 1-6
5. n = sample size for each group respectively
6. NOTE: In Excel, =STDEV.S must be used when finding s1 and s2 since the standard deviation is being calculated from a sample, not a population
2. Calculate p-value with the following formula
1. p-value = 1- T.DIST(t*, df, 1)
1. This T.DIST function will take a test statistic and find the corresponding area under the curve, or in other words the p-value
2. df = degrees of freedom = n-2 where n is the sample size for the entire data set
3. NOTE: You must do 1-T.DIST(t*,1) because Excel functions take the area to the LEFT. Since the hypothesis test is one sided in the right direction, you need to take the area to the left and subtract it from 1 to leave us with the remaining area to the right. Note that if the hypothesis test was two sided, then you would multiply T.DIST(t*,1) by 2 since you need to find two areas under the curve.

Here is the process applied to the example data set:

1. Test Statistic
1. t* = (541096.73 – 531672.72) /
2. t* = 1.1631
2. P-Value
1. df = 21613 – 2 = 21611
2. p-value = 1-T.DIST(1.1631, 21611, 1)
3. p-value = 0.12

When following this procedure, the p-value works out to be 0.12. Since this is greater than the standard cut off of 0.05, we would fail to reject the null hypothesis. This does not mean that the alternative hypothesis is to be rejected; it means that there is simply not enough evidence to support the claim that houses with grades 7-13 are priced higher than houses with grades 1-6.

However, it should be noted that this does not express the magnitude of the difference, or lack of difference, in the prices. This is where practical significance comes in. In order to see the actual magnitude of the difference in prices, a confidence interval will be calculated. Essentially, this shows us how different these prices actually are, and if they are subjectively different enough to be noteworthy.

Explanation of Confidence Intervals

Confidence intervals are ranges in which the true population parameter lies. These compliment hypothesis tests perfectly; the hypothesis test will indicate whether or not there is statistical significance, and then the confidence interval will give the magnitude of this significance. For example, if there is a statistical significance between the price of pencils sold at Walmart versus Target, calculating a confidence interval gives a range between which the true difference in prices lies. If it is found that the difference in prices is somewhere between \$0.10 and \$0.25, it can be said that the findings are not practically significant. However, if it is found that the difference in prices is somewhere between \$2.00 and \$3.00, it can be argued that that demonstrates practical significance and thus relevance for intended audiences such as consumers.

Confidence intervals are conducted at various levels of confidence, indicating the probability that the true population parameter lies within the range. To be more technical, the confidence level says that we have x% confidence that the true population parameter lies within this range. The standard confidence level is 95%, which will be used for this example. When the confidence level is increased, the confidence interval will become wider. When the confidence level is decreased, the confidence interval will become narrower. This will be demonstrated later with the example data with reasoning as to why this logically makes sense.

Here is a visual representation of the task at hand where the red circle represents the multiplier value to first be found:

Here is a brief outline of how to find the confidence interval:

1. Find the multiplier, t*
1. This is NOT the same t* that used during hypothesis testing; it is a different multiplier that is found separately
2. The opposite of the T.DIST function is used, which is T.INV(t*, df). This takes a probability / an area and degrees of freedom, and calculates the corresponding t* value. Referring back up to the figure above, since the t* value to be found includes ALL of the area to the left, the remaining area under the curve needed to be calculated and then added to 0.95. The area under the curve must add up to 1, so therefore the remaining two sections must equal 0.5; due to the symmetrical nature of the bell curve, the two areas on either tail are equal, meaning they are each 0.025. The 0.025 is added to 0.95 to get a total area of 0.975; this will be the area used in the function to find t*.
2. Find the margin of error
1. Margin of error = (t*) * standard error
1. Where standard error =
3. Add and subtract the margin of error to the difference in sample means

Here the process worked out with the example housing data:

1. T-Multiplier
1. t* = T.INV(0.975, 21611)
2. t* = 1.960074
2. Margin of Error
1. Standard error =  = 8102.85
2. Margin of error = 1.960074 * 8102.85= 15882.17
3. Lower Bound = (541096.73 – 531672.31) – 15882.17 =  -\$6,457.75
4. Upper Bound =(541096.73 – 531672.31) + 15882.17 = \$25,306.59
5. Confidence Interval = (-\$6,457.75, \$25,306.59)

When going through this process with the data set, it indicates that the difference between the two groups of homes is somewhere between -\$6,457.75 and \$25,306.59, where a negative number indicates that group two is actually priced higher than group one. Practical significance is much more subjective than statistical significance, so this price difference may not matter to some people, but it may to others. It should also be noted that the results of the confidence interval actually bolster the results found in the hypothesis test. In the hypothesis test, it was concluded that there is not enough evidence to support the claim that houses with grades 7-13 are priced higher than houses with grades 1-6. Looking at the CI, notice that the interval contains zero, meaning that there is a chance that the two groups of homes are in fact priced the same. It also includes negative numbers, indicating that it is possible that the houses with grades 1-6 are even priced higher than houses with grades 7-13. This bolsters the findings from the hypothesis testing, indicating once again that the null hypothesis should fail to be rejected. However, if the CI did not contain zero within the interval, that means that there it not a chance the prices are the same, and therefore the null hypothesis could be rejected. This is another way that confidence intervals are used to compliment and bolster the findings in hypothesis testing.

Following the same procedures outlined above, the CI was recalculated by decreasing the confidence level to 90% and then increased to 99%:

 CONFIDENCE LEVEL LOWER BOUND UPPER BOUND 90% -\$3,904.14 \$22,752.99 95% -\$6,457.75 \$25,306.59 99% -\$11,448.97 \$30,297.81

When the confidence level was increased, the CI widened. Intuitively, this makes sense since we can be more confident that the population parameter lies somewhere within this range since it contains so many possible values. The root of this is because more possible error is being accounted for as a result of a larger margin of error.

To conclude, it has been identified that there is not a statistically significant difference between the price of homes with grades 7-13 and the price of homes with grades 1-6. Therefore, the null hypothesis fails to reject, as there is insufficient evidence from the data to support the claim that higher graded houses are priced higher. We have 95% confidence that the difference in price between the two housing groups is between -\$6,457.75 and \$25,306.59.

This process is easily repeatable, especially with tools like Excel, and therefore can be used to investigate numerous variables and characteristics in the data. Here is a comprehensive table of the results from the hypothesis testing with the housing data set:

 VARIABLE SIGNIFICANCE CONFIDENCE INTERVAL Waterfront Statistically significant (958246.81, 1302378.04) Renovation* Statistically significant (190329.86, 269706.56) Renovation Year** Statistically significant (35718.74, 192890.85) Condition*** Not statistically significant (-5383.70, 15105.12) Year Built**** Not statistically significant (-21513.88, 2102.56)

* = We looked at whether houses grades that have been renovated were priced higher than houses that have not been renovated

**= We looked at whether or not houses renovated in or after 2000 were priced higher than those renovated before 2000

*** = We looked at whether houses with condition grades 4 or 5 were higher than those with grades 1,2, or 3.

**** = We looked at whether houses built in or after 2000 were priced higher than those built before 2000.

Thanks for tuning into our discussion on hypothesis testing! We hope you rejoin us back next time when we dive into linear regression modeling.

Like

Welcome back! We’re glad you made it to Part 2 of this series. Let’s dive right in.

One of the most important statistical concepts is the distribution curve. A distribution curve is a graph that shows the different frequencies of a certain variable. For example, if I were to make a distribution curve of scores received on a test, the graph would show how many people received each score. The area under the distribution curve is the probability that you see a particular instance occur. Going back to our previous example, if I wanted to find the probability of a student receiving a score between 80 and 90, I would find the area under the curve in-between those two test scores. The total area under the curve is equal to 1. These concepts will be important later when discussing more about p-value in hypothesis testing.

In later posts we will frequently be referring to a specific type of distribution called a normal distribution, often referred to as a bell curve. It is referred to as a “bell curve” because it simply looks like a bell! A normal distribution curve is special because it is perfectly symmetrical.

Sample size actually plays a large role in distribution, especially normal distribution. First, let’s give sample size a brief definition: the number of data points or subjects in your sample, typically denoted as “n.” Another way to think about it is simply how much data you have. For example, if you conducted a survey of 50 students in a high school, your sample size would be 50. The larger the sample size, the more normal your distribution becomes as posited by the Central Limit Theorem. Since many variables are naturally distributed normally, the more normal your sample distribution becomes, the better it is at approximating population parameters. This is easier to conceptualize if you can actually see if for yourself. Below we have put together a demonstration that shows how increasing the sample size causes the distribution curve to be more normal.

For this demonstration, let’s use our example data set and look at the living room size. Let’s assume that the true population size is 21,613. We began by randomly drawing a sample of x number of living room sizes from the population and then graphed the sampled data in the form of a histogram to show the distribution. Note that the x axis represents the living room size in square feet, and the y axis represents the frequency in which it occurs.

 SAMPLE SIZE DISTRIBUTION CURVE With 100 randomly selected sample data, our distribution curve has a wide spread and does not resemble a normal distribution curve at all. This indicates that a sample size of 100 is inadequate. With 500 randomly selected sample data, the distribution curve has decreased spread and begins to look more normal than that sample of 100. This is a better candidate than the sample of 100, but we can still get a better model by using more data points. With 1,000 randomly selected sample data, the data decreases its spread once again, as well as becomes approximately more normal. This is a better candidate than a sample of 500, but can become more normal with more data points. With 10,000 randomly selected sample data, this is where we see a normal distribution curve begin to truly emerge. Since the spread is decreasing, we know that the number of outliers is also decreasing. With 20,000 randomly selected data points, the curve looks about the same as the sample of 10,000, except it looks slightly more normal and has slightly less spread. This makes it a better candidate than 10,000 points, but likely only marginally. Here is the distribution curve of the “population” of 21613 data points. Obviously, this has the most minimal spread and the most normal distribution. The sample of 20,000 points most closely resembles this curve.

As we can see, as our sample size increased, the distribution curve resembled our population curve more accurately. Once we gathered a sample size of 10,000, an approximately normal distribution curve emerges and improved as we increased the sample size to 20,000. This shows why it is so important to have a large enough sample size if we want to draw valid conclusions about the population from our sample. Something to note, however, is that the sample of 10,000 data points produces a curve that looks very similar to the 20,000 and 21,613 curves; although the 20,000 technically best represents the population, we can say that we will have very similar, if not identical, results when cutting that sample size in half to 10,000 since their respective curves are mostly the same. This helps to save both time and resources since you will have to collect significantly less data.

More broadly, it also shows us why we do not necessarily need to use the full population if we want to accurately answer our questions. Why would I spend all of the time and money collecting 21,613 data points, when I could find the same conclusions with only, say, 10,000 data points? This is exactly why we use sample sizes and not populations for statistical exploration. As long as is is large enough, we can use a sample size that will accurately represent our target population.

But how do we know exactly what our sample size should be, and what is deemed large enough? Obviously we want as large of a sample size as possible, but due to finite time and resources, it may be necessary to simply find the minimum number of subjects required for valid conclusions about the population. Finding this value depends heavily on how accurate you want your results to reflect the population, which will largely be dictated by your margin of error. The smaller your margin of error, the larger your sample size will have to be. After setting a margin of error, you can use the following formulas to calculate the necessary sample size based on the type of method in use:

• If you are looking at the percentage / proportion of a population that contains a certain characteristic:
• p = the estimated percent of the population that contains the attribute in question
• q = p-1 = the estimated percent of the population that does not contain that attribute
• z* = z multiplier
• ex: 95% confidence level → = Z.INV(0.975) = 1.96 (don’t worry, we go through an example of how to do this step later in the section)
• If you are looking at a regression model, this one is a little bit more flexible and subjective
• Green (1991) → n >= 104 + k
• k = number of predictors
• Rule of thumb → 10 subjects for every one predictor

Note that if the calculated sample size has decimals, it is standard to round up to the nearest whole number, regardless of the value of the decimal. For example, a calculated sample size of 145.129 would be rounded up to 146, even though traditional rounding rules would indicate not to. Also note that margin of error must be expressed in decimal form when used in the formula. (ex: 5% → 0.05)

These formulas might seem like a foreign language if you’re non-technical, but don’t despair! They’re actually much more simple than they look. You have all of the information, you just need to plug it in and voila! All of the variables that you are plugging in are either numbers you already have or numbers you are determining yourself. If you are still a little bit confused, don’t worry! Before going on to the next section, we will run through an example for you. Let’s set the scene: you are trying to find out what percentage of houses in King County, WA have been renovated, but you are confused about how many houses are necessary to collect data on for accurate results. We will begin by finding each value we need, and then putting it together by plugging the numbers into the formula and finding our sample size. Let’s begin!

• p → we can estimate that approximately 50% (0.5) of the houses have been renovated; this is just an estimate, and does not need to be accurate
• q → 1-p → 1-0.5 = 0.5; we can estimate that approximately 50% (0.5) of the houses have not been renovated; once again, just an estimate, so don’t get too caught up in figuring out this value
• z* → let’s say we decided we want 95% confidence in our results, we would find our z* multiplier by calculating =Z.INV(0.975) in Excel, which gives us 1.96; 95% is a pretty standard confidence level to use
• Refer the Excel Guide for an explanation of why we input 0.975
• Margin of error → let’s say we decide we want our results to be within 3 percentage points of the true population parameters, then our margin of error would be 3%, but would be put in as 0.03 in the formula.
• NOTE: When we set our confidence level and margin of error, we are saying: “We want to have x% confidence that our results are within y percentage points of the true population parameter where x represents the confidence level and y represents the margin of error.”
• n = 1067.11 → round up → 1068 housing points

Like

Welcome to data science 101, your guide to important introductory topics in predictive analytics. If you are an analytics specialist looking to learn more about data science, or even just someone interested in learning more about the topic, then this is the series for you!

We want to show you that if you do not have a strong background in statistics, or if you are not considering yourself as a data scientist, you can still effectively work with and analyze data. Throughout this four part series, we will go through the following topics:

Throughout this series we will use a real data set to help explain and demonstrate the discussed concepts. Should you choose to follow along with us, we have all of our work available to you in the following links:

• For Python gurus, click HERE to download a zip file that contains the Jupyter Notebook and necessary csv files. When in Jupyter Notebook, only open the file “Blog Post Data Set and Tests (Code).ipynb” to access the notebook with the code. The other files within the folder are just csv files of the data to be read in the notebook.

All of the calculations were done in both Excel and Jupyter Notebook, but all of our explanations are done using Excel. The Jupyter Notebook file will show you how we coded the same tests as well as the graphics and figures using Python.

Before jumping into any tests or analysis, we first go through how to approach your raw data set. We have a lot to get through, so let’s get started!

The Basics: How to Look at Your Data

Before we even begin to analyze our data, it is important to first assess our raw data. Ensuring you have an in depth understanding of your data allows you to better manipulate it and draw better conclusions. When collecting or selecting data, it is important to keep two overarching questions in mind:

1. Who is my target audience?
2. What question am I trying to answer?

When thinking of your target audience, you want to make sure that your data is applicable and appropriate for that population. For example, if you were exploring housing prices for an intended US audience, it would be more beneficial to find data for houses in the United States rather than in Europe or Canada. When thinking of your question, you must ensure that your data is actually is viable for answering it. Think about the type of data you are looking for, whether it be qualitative or quantitative. For example, if you were looking at tumor sizes and dimensions, quantitative data would be more advantageous to use, but if you were looking at tumor attributes, such as shape and texture, qualitative data would be better to look at. In addition, it is important to gather any and all information that you may have about the data set. For example:

1. Where was the data collected from?
2. Who collected the data?
3. When was the data collected?
4. How large is the data set?

Some of these questions may be more relevant than others, especially depending on whether your data is primary, meaning it was collected yourself firsthand, or secondary, meaning it was collected by someone else that you intend to use secondhand. The goal of data analytics is to answer a question, and that begins with information outside of just the numbers.

Throughout this series, we will apply these thought processes and methods to a real, secondary data set pertaining to housing prices. This data set contains a sample of 21,613 houses describing their price, number of bedrooms, number of bathrooms, number of floors, grade, condition, size of the living room, size of lot, square footage of the house above ground level, size of the basement, the year it was renovated, and the year it was built. The data was collected from King County in Washington in 2015.

We are looking to answer the question: what characteristics of a home affect its price? This is an extremely overarching question that will require multiple methodologies to reap the answers we want. In the next section, we will briefly discuss key statistical terms and methodologies so you can better understand the proceeding content.

Key Statistical Terms and Definitions

Before we dive in, let’s talk definitions. Just like any other field of study, statistics has its own language, or jargon. If you aren’t a native statistics speaker, understanding this language may seem a little bit confusing, but hang with us! Throughout the series we will ensure that you can decode some of the tricky statistics language that we use, starting with this brief section where we define some critical terms and topics that you will hear frequently throughout. If you ever get confused in latter sections, just pop back up here and read through our explanations. Here is a list of some of the important terms we will be using throughout this series:

• Primary Data → Data that has been collected by you, firsthand
• Ex: Surveys, interviews, experiments, focus groups, etc.
•  Secondary Data → Data that has been collected or produced by someone else and used secondhand by another person other than the researcher
• Ex: Information libraries, public government data, population census, etc.
• Statistical Significance → In essence, this just means that something is of importance and worth investigating. If something is statistically significant, to be a bit more technical, that means that the relationship between two variables is likely not due to chance alone and can be attributed to something else.
• Practical Significance → Tells us how applicable or relevant our findings actually are; shows the magnitude of our statistical significance
• P-value → p-value, or probability value, is possibly one of the most important statistical terms to be discussed. This metric places a numerical value on statistical significance. There are different p-value cut-offs that can be used, but it is standard to say that a p-value less than or equal to 0.05 indicates statistical significance (we will be using this cutoff throughout the remainder of the series).
• Discrete Variable → A variable that can only take on a finite number of values
•  Ex: Number of rooms, number of floors, number of pets
• Continuous Variable → A variable that can take on an infinite number of values
•  Ex: Height, square footage of a house, weight
• Hypothesis Testing → Just like a science experiment, in hypothesis testing, we are simply trying to test a hypothesis or a “prediction” that we may have. In hypothesis testing, we will form two types of hypotheses:
• Null Hypothesis → Put simply, this is our hypothesis, or statement, that there is nothing going on between our two variables.
•  Ex: There is no statistically significant difference between the price of pencils at Target and the price of pencils at Walmart
• Alternative Hypothesis → This is the claim that we are trying to prove with the hypothesis test; once again, put simply, this means that there is, in fact, something going on between our two variables worth noting
•  Ex: There is a statistically significant difference between the price of pencils at Target and the price of pencils at Walmart
•  Note that all hypothesis testing is done under the assumption that the null hypothesis is true
• Dependent Variable → Also known as the “y” variable
• Independent Variable → Also known as the “x” variable; the variable we think has some effect on the dependent variable
• Linear Regression → This is a commonly used methodology to see if one variable can be used to predict another. There are two types of modeling methods:
• Single Linear Regression → Seeing if one, independent variable, or predictor, is good at predicting a dependent variable
•  Ex: Is an SAT score a good predictor of GPA?
• Multiple Linear Regression → Seeing if multiple independent variables / predictors have a relationship with a dependent variable
•  Ex: Are AP, SAT, and ACT scores good predictors of GPA?
• Correlation → The relationship between variables, typically denoted as r; r is between -1 and 1, with -1 being perfectly, negatively correlated and 1 being perfectly, positively correlated. The closer the r value is to |1|, the stronger the association. If your r value is 0, that means that there is absolutely no correlation between the two variables, so the closer the r value is to 0, the weaker the association.

If you’re ever confused about certain terminology used throughout, you can jump back up here for a quick refresher!

Can’t get enough? Well neither can we! In our next installment of this series, we will dive into the importance of the distribution curve and sample size, two concepts that are imperative for setting the stage for most of our consequent statistical testing. Click here to read.

Like