This step-by-step guide will teach you how to set up a Logi Analytics multi-tenant environment with separate databases for each tenant! First, it’s important to explain the difference between a single-tenant and multi-tenant environment. In a single-tenant environment, each customer has his or her own independent database and instance of the application. There is no sharing happening in this setup. Single-tenant environments are ideal for clients that deploy their Logi Analytics application on-prem or if they have customer-specific reports.

In many situations, it can be beneficial to have a multi-tenant setup. With multi-tenancy, a single application and its supporting infrastructure serves multiple customers. Each customer shares the application but still has their own database. Each tenant’s data is isolated and remains visible to other tenants. Multi-tenant environments result in lower costs due to economies of scale and shared infrastructure. These environments are easier to update, and configuration can be done while leaving the underlying codebase unchanged. Multi-tenant environments are also better for cloud and SaaS deployment.

Regardless of the reason why you’re using a multi-tenant environment, let’s get started with the setup explanation! There are two approaches: you can create a Logi Analytics Process to set your session variables and update the connection string or use a PlugIn to create a new connection string.

Option 1: Creating a Logi Analytics Process

StartUpProcess runs a Task in a Process definition. When StartUpProcess is in the Settings definition, it runs with each user’s session start, or optionally with just the first session. When StartUpProcess is in a Report definition, it runs at the beginning of the report generation.

In the _Settings file you need to add two DB Connections:

  1. The first DB Connection will be shared and will contain the information for all the tenant-specific database connections.
  2. The second DB Connection will be dynamic and will use the tenant details from the first DB to set the connection.


You also need to create a process in your Logi Analytics application. After you’ve created the process, you need to add the StartUpProcess element to your Settings file. Then, point it to the Logi Analytics process that you just created.

In the process, you need to create the task that will be responsible for setting up the connection string dynamically. In our example, we added a SQL DataLayer that is using the connection to a shared data source that is defined in the Settings file. We are using this source to get the result set for a specific tenant.

The DataLayer will return the connection-specific records based on the provided tenant. We are using the @tenant_id parameter inside of the SQL command to control the result set. The @tenant_id parameter is specified in the SqlParamater Logi Analytics element, in the embedded model the value for the parameter will be set by the parent application. The parent application will send the tenant ID to distinguish between the tenant.

After the DataLayer is processed, the returned values can be referenced by using a @Data Logi Analytics token. Based on the result from the Datalayer, you need to set the @Session variables that will be used to set up the connection to the database. The variables that you need to set are Database, Server, User, and Password. Once the startup process is finished, the connection will use the session variables that were created in the process.

After you’ve completed this, your application should be able to set up the connection string dynamically using the startup process.

Option 2: Plugin

Plugin is a .NET library assembly (a DLL) or Java .JAR file containing custom code that performs actions while each request is processed. Plugins can dynamically modify Definitions and have full access to web request, application, session, and response objects. A Plugin gives Logi Analytics developers the ability to programmatically extend the functionality of their Logi Analytics applications. With Plugins, developers can dynamically change the behavior of Logi Analytics reports at runtime by accessing request and session variables and modifying report definitions.

To support dynamic connection strings, a Plugin Call element is used in the _Settings definition to retrieve the connection string (or other values) from a session variable. This then dynamically modifies the _Settings definition by inserting the connection string value. The plugin runs on the LoadDefinition event and therefore alters the definition before it is processed.

To use a Plugin, you simply add the Plugin Call to your Setting’s Definition file. The Plugin is loaded as a library along with the application and runs under the application’s process on the web server. When adding the PluginCall to the _Settings file, the Plugin will execute the custom code that performs actions while each request is processed.

To set the connection string dynamically by using the Plugin you will need to add a Plugin Call element to the _Settings Definition file with the following required attributes: Assembly Name, Class Type Name, Event, and Method.  In the event attribute, you must specify when to call your Plugin. A PlugIn has the following Event options: FinishHtml, FinishData, and LoadDefinition.

To set up the database connection in _Settings dynamically, you will need to select the ‘LoadDefinition’ Event. With the ‘LoadDefinition’ Event, the Plugin is called when the definition file is loaded. The Plugin may change the definition XML; adding, removing, or changing elements. You’ll also need to specify the Plugin parameters, which are variable that you can pass into the Plugin Call element’s Method. In the Plugin Call, we are passing the Plugin parameters, which includes tenant ID and tenant name. Based on these variables, the Plugin sets the connection-specific session variables that we are using in the connection.

The connection string is being modified dynamically by the Plugin that is called in the _Settings definition. The Plugin is setting the different connection strings based on the current tenant.


We hope that one of these three approaches worked for you! Please feel free to reach out or comment if you have any questions.


Designing, improving, and creating data warehouses is what we do day in and day out at dbSeer. If you are building a data warehouse or if you simply want to make improvements to an existing one, there are five core principles that you must keep in mind. These are the best practices we’ve learned along the way and still use today.


  1. Data Quality (the top priority!)

Data quality is essential because if you don’t manage the quality of the data, you will lose the trust of the data consumer. While it might not be possible to address every data quality issue, you should have the ability to quantify the accuracy of the data for the end user. In order to get the quality correct, you need to have a system in place that captures and logs data issues in transit. That’s not all, though, you also need a mechanism to capture data issues at rest (i.e., the stored data), so you can make sure it’s not corrupted. Furthermore, there should be a framework that can automatically clean the data, or, alternatively, you should have a process so that you can do it yourself. I’ll say it one more time: data quality and the ability to measure the cleanliness of the data is critical.


  1. Traceability

Everything within your framework should be traceable. In order to have a successful production-ready environment, a non-DBA, such as a system administrator, application administrator, or business analyst, needs to be able to identify potential issues in the data warehouse or ETL Processes. There must be a mechanism in place that can trace the data movement in each step (the data size, how many rows, operations/manipulations that are happening on the data, transformations or enrichments, etc.) This information should then be available on an interface for the DBA and non-DBA – the people who aren’t experts. Over time, you will build a history that you can analyze to identify any potential issues.


  1. Modularity

A data warehouse commonly brings in a variety of data sources and serves the user for many reasons. You need to recognize the context, function, and purpose of each specific data warehouse. While there might be a cross reference between the subject areas, the goal of the database design should be to isolate modules whenever possible. If you design it this way and make sure it is in your ETL processes, the end user has the ability to turn on and off certain modules and configure it to meet their needs. This makes the design more robust.


  1. Purpose Built for Analytics Consumption 

One of the mistakes that we see often in our work is that people give technology a higher priority than analytics needs. This is a mistake – you shouldn’t let the technology drive the requirement; it should always be analytics over engineering. The priority is the end user and the analytics you’re building for that user. When you focus on engineering, the design becomes the main objective, when analytics should always be the first priority. Reporting analytics should dictate the structure, aggregation, and granularity of the data.


  1. Adaptability

 It’s common for the data warehouse to evolve as the requirements change, due to expansion or new business requirements. The data warehouse design should lend itself to that evolution. In order to do this, you should ensure that the modularity allows you to add and remove different parts. It’s always a good idea to build multi-level granularity within your dataset and data warehouse. It should be built to be flexible and allow for changes in the future.


You might be wondering if your data warehouse was designed with these principles in mind. Or, you’re building a new data warehouse and want to start off using these best practices. dbSeer is able to conduct an audit of your data warehouse against these five core principles. We will evaluate your data warehouse against our 70-question checklist and provide you with a report showing you where the gaps are and how you can make improvements. Reach out if this sounds like something you’d be interested in!


MySQL is one of the most popular open source relational database in the world. RDS makes it easier to set up MySQL databases and deployments in the cloud and simplifies operations. RDS helps DBA’s to spend more time on developing their application instead of traditional DBA tasks like replications, monitoring of performance, storage scaling, backups, restorations, and software patching.

RDS now supports the community edition of MySQL versions 5.5, 5.6, 5.7 and 8.0. Basic applications, codes, scripts, and tools that are supported by MySQL are now supported by RDS for MySQL too. So, if you are planning to move your databases to cloud then RDS makes it easier for you. RDS is cost efficient and has flexible hardware capacity. You can deploy on premise databases to the AWS cloud in minutes.

Below are the new features that were added to RDS for MySQL in 2019.


Amazon RDS Performance Insights supports viewing more SQL text

  • SQL Text in RDS for MySQL can display up to 1024 bytes of each row in the Top Load Items table for every SQL statement. In previous versions, the maximum was 500 bytes.
  • Amazon RDS for MySQL Performance Insights now supports viewing more SQL Text.

Performance Insights counters

  • Now you are able to add performance counters to performance Insights Charts for Oracle database instances.
  • Performance Insights is a feature that monitors and adjusts the database performance. This makes it easy for a developer to troubleshoot the performance issue and helps to solve the issue.
  • RDS for MySQL now supports Counter Metrics, which is a feature that allows you to add up to 10 extra graphs that visualize the OS and DB performance metrics. Counter metrics provides extra data that corresponds with the chart of the database load, which helps to identify and analye the DB performance.

Supports db.r5 DB instance classes

  • r5 is a memory optimized DB instance with improved Amazon EBS performance and networking. This instance is brought in by AWS Nitro System with both lightweight hypervisor and dedicated hardware.
  • The R5 instance type is now supported by Amazon RDS.

Supports db.t3 DB instance classes

  • t3 is the latest generation burstable performance database instance class. It carries the feature full CPU Utilization and more computing capacity when compared to db.t2.

Support for minor versions of MySQL

  • Minor versions of MySQL 5.7.25, 5.7.24 of 5.7 and 8.0.15 of 8.0 major version are now supported by Amazon RDS.
  • These versions provide improvements in functionality.

Support for restoring 5.7 backups from Amazon S3

  • This new feature will let you to create backup of databases of MySQL version 5.7.
  • Backups can be stored into an Amazon S3 bucket and you can restore the backup file into the new Amazon RDS database instance running on MySQL.

Support for the MySQL password validation plugin

  • For improved security, Amazon RDS provides password validation plugin “validate_password” for MySQL.
  • This feature is available for versions 5.6, 5.7, 8.0 of RDS for MySQL.
  • This feature has requirements to set a password, including certain characters and password length.

Supports Upgrade Prechecks

  • When upgrading from a lower version of MySQL (ex:5.7) to higher version(ex:8.0), RDS will precheck for compatibilities and incompatibilities. If any incompatibilities are found, the upgrade will immediately stop.
  • RDS generates a “PrePatchCompatibility” log file that holds the details about the incompatibilities. This will help you to review and take actions to avoid incompatibilities.

Support for 64 TiB and 80k IOPS of storage

  • Amazon RDS lets you to create DB instances up to 80,000 provisioned IOPS
  • Storage from 20TiB to 64TiB.
  • This increase enables you to get one instance of database by combining database shards, making database management easier with zero downtime.

Support for Storage Auto Scaling

  • About a month ago, RDS for MySQL started supporting auto scaling for RDS storage.
  • What is Storage Auto Scaling? Scaling of storage capacity is flexible to improve database workloads with no downtime.
  • Earlier storage capacity had to be provided manually based on assumptions for demands of applications. If less that the required storage capacity is provided, it can lead to downtime of applications. Excess storage capacity provision leads to unused resources and unnecessary costs.
  • Auto scaling of storage helps to provide zero downtime and allows you to pay for only resources that you use.


If you’re using RDS for MySQL, let us know what you think of these updates!


SQL Server is one of the many database engines supported by Amazon RDS. RDS supports many versions of SQL Server databases, including 2008 R2, 2012, 2014, 2016, 2017 and express. RDS makes it easy for DBA’s to set up, operate and deploy SQL Server databases in the cloud.

2019 didn’t just bring changes in your weight-loss goals, it also brought many updates for RDS for SQL Server. Thankfully, these changes are permanent, unlike your half-hearted attempts to make it to the gym. Keep reading to learn what’s new!



Amazon RDS Performance Insights are available

Amazon RDS now supports Performance Insights for SQL Server RDS. Performance Insights is an advanced database performance monitoring feature to diagnose performance challenges on RDS. It helps understand your database workloads and provides a visual dashboard to understand performance problems.

Database Limit Increase

RDS for SQL Server has increased its limit of the number of databases per instance from 30 to 100. System databases like master, tempdb, etc. do not count towards this limit.

Multiple file restore and native restore improvements

Amazon RDS for SQL Server now supports restoring multi-file backup from S3. It now supports migrating of 16 TiB backups, up from the previous limit of 5TiB.

Support for storage auto scaling

SQL Server now supports automatic scaling in Storage Capacity in response to increasing workloads without any downtime. With this auto scaling support, you simply set the max storage limit and the rest is taken care of. Autoscaling works on new and existing databases. It just requires a few clicks to enable it. There is no additional cost for Autoscaling.

Always-On Availability Groups supported in Microsoft SQL Server 2017

To support fault tolerant, available and scalable workloads, SQL server RDS now supports Always On Availability Groups. Always On Availability automatically replicates databases between 2 availability zones to ensure high availability and durability of SQL Server Databases. SQL Server RDS instances running on SQL Server 2016 or 2017 Enterprise Editions using Multi-AZ would automatically have a primary database in one Availability Zone and a stand-by replica in a different Availability Zone.

Added support for Microsoft SQL Server Audit

Amazon RDS now supports SQL Server Audits. SQL Server supports Server level audits for server level events. Enterprise editions also support Database level audits for Database level events. These audits can be uploaded to S3. Enabling retention policies on the S3 bucket can configure keeping these audit logs for a defined period of time.


After reading that impressive list of updates, I’m sure you’re already on your way to exploring all that RDS for SQL Server has to offer! Enjoy, and, as always, feel free to reach out with any questions.


Amazon Aurora is relational engine that is designed to be compatible with databases like MySQL and PostgreSQL. This allows you to make your existing application work in the cloud – but at 5x the standard performance of these databases! Amazon Aurora supports very large databases with compromising performance. It’s a reliable and cost-efficient option for growing businesses.

Below you can explore the plethora of updates that came with RDS for Amazon Aurora in 2019!


Amazon Aurora Updates

Sharing and Cross-Region Copying of Snapshots

Aurora Serverless DB clusters snapshots can now be shared publicly or with other AWS accounts. Aurora Serverless DB cluster snapshots across multiple AWS regions.

Authorized AWS accounts can easily restore a DB cluster from the shared DB cluster snapshot without copying the snapshot.

This feature is also supported by Amazon Aurora with PostgreSQL.

Capacity of 1 Unit and a New Scaling Option

Instead of managing and providing DB instances, Aurora serverless lets you provide minimum (1 unit) and maximum values of ACUs for your DB cluster. An ACU is a capacity measure that describes both performance and memory capacity.

Availability to 14 AWS Regions

Aurora global is now supported and is available in 14 AWS regions – see below.

EU (Ireland), EU (London), US East (Ohio), Asia Pacific (Seoul), US East (N. Virginia), US West (Oregon), Asia Pacific (Mumbai), Canada (Central), EU (Frankfurt), Asia Pacific (Singapore), EU (Paris), Asia Pacific (Tokyo), US West (Northern California), and Asia Pacific (Sydney).

Replications can be set in-between any of the two AWS regions mentioned above.


Performance Insight Support

Performance Insights a feature that monitors the database performance by visualizing database load and performance. This later helps in spotting and troubleshooting performance challenges.

This feature is available in Amazon Aurora.

RDS Recommendations

The RDS Recommendations feature in Amazon Aurora provides automated best practice guidance by evaluating your configuration, usage, and performance data. This includes information about issues in configurations, such as mismatch in instance classes or outdated versions of engine.


Amazon Aurora with MySQL Updates

Performance Insights Counters

Performance Insights is an advanced performance monitoring tool for database workloads. Additional Counter Metrics are now supported for AWS MySQL RDS. Counter Metrics allows customizing the Performance Insights dashboard with 10 additional graphs.

Amazon CloudWatch

Amazon Aurora serverless- MySQL compatible databases can now be set to issue a log report to Amazon CloudWatch that holds information about general logs, slow query logs, audit logs, and error logs. This will help you keep track of your Aurora Serverless databases query performance, database activity, and errors.

Supports db.r5 an db.t3 DB Instance Classes

Amazon EC2 R5 instances are the next generation of memory optimized instances for the Amazon Elastic Compute Cloud.  R5 instances are well suited for memory intensive applications such as high performance databases, distributed web scale in-memory caches, mid-size in-memory databases, real time big data analytics, and other enterprise applications.

T3 instances are the next generation low cost burstable general-purpose instance type that provide a baseline level of CPU performance with the ability to burst CPU usage at any time for as long as required. T3 instances are designed for applications with moderate CPU usage that experience temporary spikes in use.

Amazon Aurora for MySQL now supports R5 and T3 instance classes.

GTID based replication

Amazon Aurora with MySQL 5.7 compatibility now supports GTID. GTID (Global Transaction Identifier) is a unique identifier that is assigned to each transaction of the MySQL database.

Faster Migration from MySQL 5.7 databases

MySQL 5.7 backups stored in S3 can now be restored in Amazon Aurora with MySQL compatibility. Both full and incremental backups of the database can be restored. This is much faster than the mysqldump utility.

Data API

Data API is now available for Aurora with MySQL Compatibility. Web service based applications like AWS Lambda, AWS AppSync, AWS Cloud9 can access the Aurora cluster. Data API doesn’t require a persistent connection to the DB cluster. It provides a secure HTTP endpoint to run SQL statements.


Amazon Aurora with PostgreSQL Updates

Logical Replication

Amazon Aurora with PostgreSQL compatibility supports Logical Replication. Using logical replications, you are able to replicate the data changes from one Amazon Aurora database to other databases by utilizing replication tools like Amazon DMS or native replication slots of PostgreSQL.

Activity Streams

Amazon Aurora with PostgreSQL now supports Activity Streams for real time monitoring. Database Activity Streams provide a real-time data stream of the database activity in your relational database.

Cluster cache management

Cluster cache management is now supported by Amazon Aurora with PostgreSQL compatibility. This feature provides a faster way to recover incases of failovers. With Cluster cache management, a specific reader DB instance in the cluster is designated as the failover target. Data in the designated failover target cache is synchronized with the read-write instance cache. Incases of failover, the designated failover target is promoted as the read-write instance.

Data import from Amazon S3

Amazon Aurora with PostgreSQL allows you to import data from an Amazon S3 bucket into Aurora Postgres instance. A new extension aws_s3 has been added to support the import of data. Any data format that is supported by PostgreSQL COPY command is supported for import.

Copy tags from cluster to database snapshot

Tags on an Amazon Aurora with PostgreSQL compatibility database cluster can automatically be copied to any cluster-created database snapshots. This allow easily setting metadata on the snapshots to match the parent cluster and access policies.

Restore an Encrypted Amazon Aurora PostgreSQL Database from an Unencrypted Snapshot

An encrypted Amazon Aurora with PostgreSQL compatibility cluster can now be restored from an unencrypted snapshot. To encrypt the new database, create a new key or choose any existing Amazon Key Management Service (KMS) key.

Cloning Across AWS Accounts

Amazon Aurora DB cluster can now be shared with other AWS accounts. This provides a quick and efficient way of cloning the database. Database cloning does not require additional space and is faster than restoring a snapshot.

Clearly, RDS for Amazon Aurora saw a lot of upgrades this year! Feel free to contact us if you want to get started with these services.


Amazon RDS (Relational Database Service) is one of the many web services offered with AWS. Amazon RDS makes the job of database administrators easier – aiding to setup, operate, and scale the database and deployments. RDS is a great platform for DBAs who are looking to spend less time on traditional tasks like platform work, security policy provisions, performance monitoring and database handling. This frees up time to focus on developing an application.

RDS supports DB engines such as MySQL, MariaDB, PostgreSQL, SQL Server and Oracle. If you now have an Oracle database and are planning to move to cloud, RDS will help you deploy within minutes. You won’t lose any of your capabilities or features and will gain flexible capacity at a low cost. You can also migrate any applications or tools that use Oracle databases. RDS is improving constantly, with the primary goal of making its platform easier for DBAs.

Without further adieu, here are the new features added to RDS for Oracle in 2019.

Supports a new version of SQLTXPLAIN (SQLT)

SQLT is an Oracle tool that helps identify and troubleshoot performance of SQL Statements. Oracle SQLT delivers a report that includes statistics, execution plan, optimizer performance and other details that can help tune SQL queries. RDS for Oracle now supports 12.2.180725 version of SQLTXPLAIN.

Supports db.z1d and db.t3 DB instance classes

Z1d are the latest generation of instance classes optimized for memory intensive applications. They offer both high compute capacity and high memory footprint.

T3 instances are the next generation low cost burstable general-purpose instance type that provide a baseline level of CPU performance with the ability to burst CPU usage at any time for as long as required. T3 instances are designed for applications with moderate CPU usage that experience temporary spikes in use.

Oracle RDS now supports for the Z1d and T3 instance families.

Supports Oracle APEX versions 19.1.v1 ,18.1.v1 and 18.2.v1

APEX (Application Express) is a low code browser-based development platform that helps developers to build applications that can be deployed anywhere. 11.1, 12.1 and 12.2 versions of Oracle in RDS now support 19.1, 18.1, and 18.2 versions of Oracle APEX.

Supports Amazon S3 integration

Oracle RDS now supports integration with S3 for data ingress and egress. This integration support easy, efficient and secure data transfer between Oracle RDS and S3. Before this integration support, data transfer from S3 to Oracle RDS required an additional DB instance and a networking link between the two instances.

Supports Read Replicas

Oracle RDS supports Read replicas with Active Data Guard. Oracle RDS makes it easy to create replicas within the same region. The Primary DB instance manages the configuration of Active Data Guard and has a secure communication between the replica instances.

Supports RMAN DBA tasks

Oracle RDS supports Oracle Recovery Manager (RMAN) DBA tasks. Using Amazon RDS package rdsadmin.rdsadmin_rman_util, database backups can be performed to disk. rdsadmin.rdsadmin_rman_util package supports full and incremental database file backups, tablespace backups, and archive log backups.

Support for 64 TiB of storage and 80k IOPS

Maximum Storage of Oracle has now been increased to 64 TiB. Oracle RDS also supports a provisioned IOPS performance of 80,000 IOPS. Any existing databases can be scaled up to 64 TiB without any downtime. With this increased limit, you can consolidate your database shards into a single database instance.

Amazon RDS Performance Insights Counters

Performance Insights is an advanced performance monitoring tool for database workloads. Additional Counter Metrics are now supported for AWS Oracle RDS. Counter Metrics allows customizing the Performance Insights dashboard with 10 additional graphs.

Amazon RDS Performance Insights Supports Viewing More SQL Text

Performance Insights for Oracle RDS now supports 1024 bytes of SQL text to be shown for each row in Top Load Items. The earlier limit was 500 Bytes.


AWS SageMaker is a machine learning platform for data scientists to build, train, and deploy predictive ML models. Unlike the traditional machine learning process, SageMaker allows data scientists to hop into the driver’s seat on projects and complete all three steps independently.

In an attempt to show you how to use SageMaker yourself, we decided to use the platform to create a machine learning model of our own. Using publicly available airline data, our goal was to create a model that would predict whether or not a flight would be delayed. For this blog post, we are going to walk you through from start to finish how we created our machine learning model using SageMaker. If you would like to follow along with us, click HERE to access an HTML copy of our notebook. Let’s get started!

The Set Up

To start creating resources, we first went to the AWS website and logged into the console. Before we use SageMaker, we need to ensure that we either have an S3 bucket ready to use, or create a bucket that we can use to create our instance. Whatever S3 bucket you decide to call, it MUST be in the same region as your notebook instance.

Once we have an S3 bucket ready to use, we can now head over to SageMaker within the AWS console to set up our notebook. When we arrive at the SageMaker dashboard, click on the button that reads “Notebook Instances.” This will direct you to where you can either access a pre-existing notebook, or make a new one. To make a new resource, we are going to follow the following procedure:

  1. Select “Create notebook instance”
  2. Give your notebook a name
  3. Select the instance type
    1. You need to ensure that your instance is big enough to support your workload. Typically, we will use an ml.t2.medium instance, which has 4 GiB of memory, 2 vCPU, and low network performance. For our data set, we used an ml.m5.xlarge instance which has 16 GiB of memory, 4 vCPU, and high network performance.
  4. If you would like to add an Elastic Inference, select which type you would like to add here; note that this is not required
  5. Call your IAM role
    1. Select “Create a new role”
    2. Select “Specific bucket” → type in the name of the specific S3 bucket you would like to call
  6. Select “Network”
    1. Select “VPC” and select the default option from the drop down menu
    2. Select a subnet
    3. Select the default security group from the drop down menu
  7. Select “Create notebook instance” at the bottom of the page

From here you will be directed to the Jupyter Notebook server:

  1. If you need to upload your data from your drive, select “Upload” in the top right corner, and then select the file you wish to upload into the server
  2. To create your notebook, select “New” in the top right hand corner, and then select your preferred notebook type from the drop down menu. SageMaker supports both Python and R, and numerous environments. For our example, we decided to use “conda_python3.” From here you will be directed into your notebook resource where you can begin creating your model.

Step One: Preparing the Data

Now that our notebook is created and ready to use, we can begin building our model. Before we get started, let us note that we have decided to use the “linear learner” algorithm, which will dictate how we approach certain steps. If you decide to use a different algorithm, check out the SageMaker documentation for what needs to be done. Here is how we went through the building step using SageMaker:

  1. Import all of the necessary libraries and modules; to see which ones we used, check out the HTML file that contains all of our code.
  2. Read your csv as a pandas data frame
  3. Clean the data
    1. All data cleaning is subjective and up to you
  4. Format the data according to the model’s requirements. In the case of the linear learner, we must do the following:
    1. The data must be transformed according to how it is being fed to the training algorithm
      1. This information can be found in the SageMaker documentation
      2. For the linear learner, we transformed the data into matrices
    2. All variables need to be integers
    3. In the case of a binary classifier, all Booleans must be expressed as 0 and 1. In the case of a multi linear classifier, the label must be between 0 and the number of classes minus 1.

Step Two: Training the Model

  1. First we need to decide what algorithm we are going to use. For our model, we used linear learner. The remaining list of readily available algorithms is provided in the SageMaker documentation.
  2. Split up the data into training, testing, and validation data. We split the data in the following proportions:
    1. 80% used for training
    2. 10% used for testing
    3. 10% used for validation
    4. The splitting of the data is subjective and depends on the use case and other factors
  3. Create three separate channels for the training, testing, and validation data respectively
  4. Start a training instance using your algorithm of choice. This creates a container with the algorithm, which in our case was linear learner, implemented and ready to use.
  5. Run the training job
    1. This will take a bit of time; our training job took approximately 1600 seconds, which equates to about 27 minutes. The length of the training job largely depends on the size of your data and your algorithm.

Once the training job is complete, you are ready to deploy and evaluate your model to see how well it performs.

Step Three: Deploying and Evaluating the Model

  1. Using one line of code, deploy the model to return an endpoint so the model can now be used to make predictions.
  2. Once the model is deployed, create a function and call it to evaluate the accuracy of the model
    1. We found that our model runs with 76.72% accuracy, which is decent.
    2. We found this metric simply by taking the number of successful predictions out of the total number of predictions made using the test data. Therefore, we can say that our model accurately predicted if there was a flight delay 76.72% of the time.
  3. Make sure you terminate the endpoint when you are done using it so you are not billed for resources not being used.

Final Thoughts and Best Practices

After using the application to build our model, here are some of our final thoughts: =

  • SageMaker helped to streamline the process and saved an immense amount of time, particularly in the training and deployment stages
  • Make sure you select an instance type that best fits the need of your data. You can run into issues, such as the kernel frequently crashing, if your instance type does not support the size of your data and complexity of your chosen algorithm. For more information, consult the  SageMaker documentation.
  • Use SageMaker documentation and supplementary resources to guide you through the process, as well as help provide explanations for AWS specific features you may not already be familiar with.
  • Unlike some other machine learning platforms, SageMaker does require ample experience and knowledge in Python. If you are inexperienced or novice in this area, this may not be the best platform for you
  • Always terminate resources you are finished using.
    • Stop running the notebook once you are finished working on it
    • Terminate the endpoint when you are done using it

Once again, if you would like to follow along with us or see how we went about this process, click HERE to look at an HTML file of our notebook. If you are interested in learning more about AWS SageMaker, feel free to reach out to our team – we would love to talk to you!


Welcome back to our fourth and final installment of our series! In this post, we will be discussing single and multiple linear regression, so let’s jump right into it.

Making Predictions Using Single Linear Regression

Linear regression is a statistical modeling tool that we can use to predict one variable using another. This is a particularly useful tool for predictive modeling and forecasting, providing excellent insight on present data and predicting data in the future. The goal of linear regression is to create a line of best fit that can predict the dependent variable with an independent variable while minimizing the squared error. That was a pretty technical explanation, so let’s simplify. We are trying to find a line that we can use to predict one variable by using another, all while minimizing error in this prediction. In order for this line to be helpful, we need to find the equation of the line which is as following: 

    • ŷ → the predicted dependent variable 
    •  → the slope of the line
    • x → the independent variable aka the variable we are using to predict y
    • → the intercept of the line

This equation may look familiar, and it should. It is the same equation as your standard y= mx + b that you learned back in Algebra I, just written a little differently in statistics language. 

Some of these concepts are difficult to understand on their own, so let’s apply them to our example housing data set to see them in action.

When we calculated correlation, we found a pretty strong correlation between the size of the living room and the price. Now that we know that there is something going on between those variables, we want to see if the living room size can be used to predict the price of our house. Before we jump into calculations, we first want to make sure that we can actually use the linear regression methodology; this means we need to check to see if the relationship between our living room size and price is approximately linear. 

On the left we have all of our data points scattered, and on the right we have the same graph except the line of best fit that we are trying to find is included as well:

When we scatter our data, we find that the relationship between the living room size and the price is approximately linear, meaning we now have the thumbs up to begin our linear regression. 

Testing single linear regression can be done by hand, but it is much easier and quicker to use tools like Excel or Jupyter Notebook to create predictive models. In order to conduct a regression analysis in Excel, you need to make sure that your Analysis ToolPak is activated. How to activate the ToolPak will depend on what version of Excel you are running, so I would recommend just doing a quick Google search on how to set it up. It is fairly simple, so this shouldn’t take any longer than a minute or two. Once it is activated, you are good to go!

Here is a brief outline of how to conduct your regression analysis using Excel: 

  1. Select “Data” tab → Select “Data Analysis” → Select “Regression”
  2. Input Y Range → select the data for the variable you are trying to predict; dependent variable 
    1. With our example: Price
  3. Input X Range → select the data for the variable you are using to predict the other variable; independent variable
    1. With our example: Living room size 
  4. Select the “Labels” box so that your output sheet will include the corresponding data labels
  5. If you want your regression output to be inserted onto a new worksheet page, check the box for “New Worksheet Ply”
  6. Check the boxes that say “Residuals” and “Residual Plots;” this will be important later on in our analysis and conclusions

When you press enter, you will be directed to a new workbook sheet that will have a bunch of output tables that may seem overwhelming and confusing, but don’t worry! We are going to show you which numbers are actually important for you to look at. 

We have highlighted all of the boxes / numbers that are important to look at. Let’s walk through these individually.

  1. R Square → the r square value tells us how good our model is at predicting the dependent variable. The closer the number is to 1, the better our model is. With this output, we see our r square value is 0.4949, which means that 49.49% of our data can be explained by our model. 
  2. Intercept Coefficient → the intercept coefficient value tells us our  value, which is the intercept of the regression line; this tells us that when the square footage of the living room is zero, the price of the house is -$43580.74
    1. NOTE: The intercept is not always applicable, which is true in our instance since a negative house price is impossible
  3. Sqft_living Coefficient → this gives us the slope of our line, which is ; the slope tells us as the square footage of the living room increases by one square foot, the price of the home increases by $280.62.
  4. Sqft_living p-value → this will tell us if the predictor in question, which in our case is the living room size, is a good predictor of our dependent variable. In this instance, with a p-value of approximately zero, we can conclude that the living room size is a good predictor of housing price. 
    1. Note: You can also look at the Significance F value, but if your predictor is statistically significant, the model will be too and vice versa

Now let’s put it all together to create the equation of our line:

Predicted Price = 280.62(living room size) – 43580.74

Now, ideally, given a living room size, we can make a prediction of the price of any given house. 

Now that we have built our model, it is important to then look at our residuals, otherwise known as errors. The residual / error, typically denoted as “e,” is essentially how far off our prediction was from the actual price. This is given by the following equation: 

e = actual – predicted

If you want to see this written in statistics language, keep reading; if not, go ahead and jump to the next section. 

e = y – ŷ 

Visually, if the data were to be scattered and the line of best fit laid on top, then the error values would be the distance from the actual observed value to the line at that specific x value, which would be the predicted y value. Here is a visual to help conceptualize this: 

When our error is negative, that means that we overestimated the price; when our error is positive, that means we underestimated the price. When our error is zero, that means our predicted price was accurate with the actual housing price given. The smaller the error, the more accurate our prediction is. 

When looking at error, we want to ensure that we are equally overestimating and underestimating our data, as well as that errors are occurring randomly. This is why when we are constructing our data analysis we want to include residuals and residual plots. How do we make sure these assumptions are being upheld? Excel will calculate our all of our errors and then scatter them on a graph with the x axis being the independent variable and the y axis being the error value. We will look for the following characteristics on the graph to ensure these assumptions are upheld: 

  1. The data points are scattered evenly around y = 0
    1. Equally over and underestimating the dependent variable
  2. The residual plot follows no pattern
    1. Errors are occurring randomly 

Here is an example of a residual plot that upholds these assumptions relatively well:

Here is the residual plot for our single linear regression analysis with living room size: 

Looking at the plot, we can see the residuals are approximately scattered around zero, meaning we are equally over and under estimating the price. However, we do notice a pattern, meaning the errors are not necessarily occurring randomly. Looking at the graph, we notice a horn shaped pattern, meaning that as the living room size increases, so does the error. This indicates a heteroscedasticity problem, which is a much simpler concept than the word is to pronounce. All this means is that the residuals are occurring unevenly, resulting in the pattern we see above. In order to uphold the regression assumptions, we would want to see a pattern of homoscedasticity, meaning that the residuals are occurring evenly throughout. This breach of the assumption is very important to note, because that means that the living room size may not be a good candidate for predicting house size. This means we should look more into other predictors and find a better candidate. Let’s take a look at the basement size.

Here we see that the that the p-value for basement size as a predictor and the F value for the model are both zero, meaning that the model is a good predictor of price. Moving to the r square value, we find that it is less than that of the living room model; this means that the model with the basement size explains much less data than that of the living room size. However, the issue that we came across with living room size lied in the residual plot, so let’s take a look at that. 

We first notice that the residuals are evenly scattered around zero. Then we notice that, unlike the living room size residual, there is no apparent pattern in the plot. This means that there is homoscedasticity and therefore is a better candidate for predicting price than the living room size is. 

Once again, this process is easily repeatable due to tools such as Excel. We have gone ahead and tested the remaining viable predictors and created a comprehensive table of our findings: 

Living Room Size (in sqft) Yes


Living Room Size 2015Yes


Basement (in sqft)Yes


Lot Size (in sqft)Yes


Lot Size in 2015Yes


Square footage of the house above groundYes



Of the single predictor models, we can see that the basement model and the living room size in 2015 model uphold the assumptions the best out of all of the other predictors. However, it should be noted that both r square values were modest, so we need to see if there is a better model out there. We know that one variable likely is not able to be the soul predictor of housing price, so what if we combine multiple predictors into one model? That is where multiple regression comes in. 

More Predictive Modeling with Multiple Linear Regression

We are going to build off of single linear regression, so if you are still confused about that, it may be beneficial to go back through and review the previous section until you feel you comfortable with the concepts. If you are good to go, then let’s venture on to multiple linear regression! 

Multiple linear regression is very similar to single linear regression except it is a little bit more complex. Instead of looking at just one predictor, we are now going to incorporate multiple predictors into our model to see how good they are at predicting price together. So with our example, instead of looking at just one predictor of housing price at a time, we can create one model that incorporates all of our different predictors to see if we can better approximate the price.

One of the assumptions for multiple linear regression, just like single linear regression, is that the multivariate regression line is approximately linear. However, unlike our single linear regression line, it is near impossible to confirm whether or not we uphold this assumption since our line cannot be conceptualized on a 2-D plane. This does not pose as a serious obstacle since deviation from the assumption does not impact our results in a major way. However, it is a good rule of thumb to look at each variable’s linearity individually just to ensure that there is nothing that poses as a serious threat to the assumption. When looking at the variables we will be exploring, all of them follow an approximately linear relationship and therefore should uphold our linearity assumption for the multivariate model. 

Once again we will be using the Analysis TookPak in Excel, so make sure that is good to go once again. Head to the data tab and select data analysis and regression, just like we did with the single linear regression. The only thing that we are going to change here is our “X Input Range” values. Instead of inputting just one predictor, we are going to input all of our different predictors at the same time; so in this instance: the living room size, the lot size, and the square footage of the house above ground level. Make sure you select the “label,” “residual,” and “residual plot” boxes as well. Press enter, and you are good to go. 

Since multiple linear regression is more complex than single linear regression, it is fitting that there are a few more numbers that we will have to look at when doing our analysis. We went ahead and highlighted them for you: 

Once again, let’s go through each of these numbers individually:

  1. Adjusted R Square → As you increase the number of predictors in your model, your r square value will artificially inflate. That is why there is an adjusted r square value that accounts for the increase in the number of predictors to give a deflated and more accurate figure.
  2. Regression Significance F → This is like a p-value but for our entire model; it will tell us whether or not our model is a good fit for predicting our dependent variable, which in this case is price.
  3. Intercept Coefficient → Just like in single linear regression, this will tell us the intercept for our model. 
  4. Predictor Coefficients → The remaining coefficients for our different predictors will give us the slopes for each independent variable respectively.
  5. P-value → This tells us whether or not the variable is good at predicting price given that the other variables are already included in the model. 
    1. Ex: Living room size is a good predictor of price when lot size and the square footage of the house above ground are already included in the model 

Now we can put these together to create our equation: 

Price = 269.15(living room size) – 0.2782(lot size) – 16.90(sqft above ground) – 41445.124

Now looking at “Significance F,” we see that it is approximately zero. What does this mean? For consistency, let’s keep our significance level, or cut off value, at 0.05. Since our F value is less than our cut off, just like a p-value, we can say that our model is a good fit for our data and for predicting our dependent variable. This is a great tool for comparing different models. For example, let’s say we created another model that had an F value greater than 0.05, then we know it is not a good fit for the data, and therefore we should go with our other model since it statistically proves to be a better fit. 

But what if there are multiple models that prove to be significant? How do you pick which one is best? This is where it gets to be a bit more subjective and flexible. To do this, you will have to examine and compare your adjusted r square values. The higher your adjusted r square value is, the better your model is at explaining the data and therefore is a better model for predicting our dependent variable. However, if we increase our predictors and we only see a slight increase in the r square value, it technically is a “better model,” but we can conclude that the additional predictors are not really contributing to the model in a significant way since they don’t help explain that much more of the data. 

So let’s say that you drop all variables except for one and you find that your r square value is similar to that of the model that includes all of the possible variables, which will be referred to from here on out as the “extended model.” What does this result mean? This means that with one predictor, you are able to explain approximately the same amount of data as you can with the extended model. And what does that mean? That means that your other variables are mostly just deadweight, not really contributing anything meaningful to your model. Your model with just one variable is able to do just as good of a job as the model with multiple variables. You can then make the model more simplistic and concise by dropping variables that don’t significantly impact the model. This is what makes F testing flexible; you are able to play around with a multitude of different models and see for yourself which variables are significantly impacting your dependent variables and which aren’t. 

We played around with some different models and put them into a comprehensive table for you below:

Living room size, lot size, sqft of above groundSignificant0.4942
Living room size and lot sizeSignificant0.4938
Living room size and sqft of above groundSignificant0.4932
Lot size and sqft of above groundSignificant0.3671
Living room size Significant0.4928
Lot SizeSignificant0.00799
Sqft of above groundSignificant 0.3667


This is a lot of data to digest, so let’s break it down so it is a bit easier to comprehend. We don’t need to bother looking at the significance since all of the tested models proved to be statistically significant. Does that mean that they are all good models? Technically yes, but we can pick the best model by looking at our adjusted r square number. We see that the highest r square value is the model that includes all of our variables; note that from here on out we will be referring to this as the “extended model.” Does that mean the extended model is the best model? Once again, technically yes, but let’s dig a little deeper. We see that our model that only includes the living room size has an r square value that is within hundredths of the extended model. This means that the model with only living room size can explain the same amount of data as the extended model. What can we do with this information? We can conclude that the lot size and square footage of the house above ground level are not really contributing that much since they are not really helping to explain more of the data, so we can drop them from our model and get almost identical results. It can’t hurt per se to use our extended model, but using just the one variable makes our model more concise and easier to work with, leaving less room for errors and complications. 

How do we know when adding a predictor to our model is useful? Let’s compare our model that includes only the lot size to our model that includes both lot size and living room size. When we add the living room size to the lot size model, our new model is now able to explain 48.581% more data. That’s a lot! This shows us that adding the living room size to our lot size model is extremely valuable since our r square value, and thus ability to explain the data with our model, has increased drastically. We also see our model be able to explain 12.71% more data when we add living room size to the lot size and square footage of the house above ground model, which is modest but still nonetheless impactful and noteworthy.  

Some Parting Thoughts and Comments

Thanks for sticking with us through the series! We hope that you have taken something away from these posts that can help you to better look at and analyze your data. We also hope that we have demonstrated to you that you don’t need to be a data scientist or a statistics expert to turn your data into insight. 

For this blog series we ran quite a few numbers and tests in both Excel and Jupyter Notebook, so once again, if you are interested in seeing exactly what went into the calculations and graph / figure creation, follow the links below to be directed to all of the work we did with our data. Within the Excel worksheet you will find lots of data and lots of tests. The first worksheet includes all of the data as well as some relevant information about the set and decoder keys. All subsequent pages are various tests that were conducted and reported throughout this paper. Within the Jupyter Notebook file you will also see the data uploaded from a csv file into the worksheet. Python was then used to create the sample size demonstration and all of the graphs / figures you see throughout. In addition, it demonstrates some alternative methods for calculating correlations, line parameters, and errors that we used to check our numbers with the Excel calculations. These functions that we have defined make calculations and testing highly repeatable, we would recommend using them with your own data set for those of you who are interested in using Jupyter Notebook instead. Do note, however, that the results on both Excel and Jupyter Notebook are identical, and neither is more accurate or beneficial than the other. This is simply just a way to show how you can perform these analyses on either platform depending on your skillset and knowledge.  

Link HERE for Excel workbook 

Link HERE for Jupyter Notebook file 

Part 1 of Series: Click here

Part 2 of Series: Click here

Part 3 of Series: Click here

Microsoft Excel Guide

  • =AVERAGE(array)
    • Takes the average of the array
  • =STDEV.S(array)
    • Standard deviation of the sample  
  • =CORREL(array1, array2)
    • Finds the correlation between data array 1 and data array 2 
  • =Z.DIST(z*, 1)
    • Finds the area to the LEFT of the z* value
    • Use “Z” when you are working with a population
  • =Z.INV(probability)
    • Finds the z* value for the corresponding probability value
  • =T.DIST(t*, df, 1)
    • Finds the area to the LEFT of the t* value 
    • Use “T” when you are working with a sample 
  • =T.INV(probability, df)
    • Finds the t* value for corresponding probability value 

A Note About the T.INV and Z.INV Functions 

We used these functions to find t* and z* multipliers for confidence levels / intervals. When we conducted a confidence interval at 95% confidence, we did not input 0.95 into the T.INV or Z.INV function, but rather 0.975. Here is an explanation as to how to find what value to input into these functions: 

The figure below exhibits the task at hand when finding a t* or z* value at a certain confidence level. The blue area represents our confidence level, which is also a probability value. What we want to do is find the t* or z* value that corresponds with this probability value, as represented by the red circle. 

In this example, we are finding a 95% confidence level, and therefore are looking at a probability value of 0.95. When using the T.INV or Z.INV function, we need to use ALL of the area to the left of the t* or z* value that we are trying to find; that means that we need to find the remaining area under the curve.

We know that the area under the curve must total 1, so therefore the remaining area much equal 5 (1-0.95). We also know that since we are working with a normal distribution, the curve is perfectly symmetrical. This means that the remaining area on either tail of the curve are equal. If the remaining two sections must equal 5 (0.05) and there are two tails, we know that each area must be 2.5 (5 / 2 = 2.5 or 0.05 / 2 = 0.025). Now we know that the left tail of the curve must be added to the 0.95 to find the cumulative area the the left of our multiplier, leaving us with 0.975 (0.95 + 0.025). The 0.975 value will be what we input into our Excel function to find the multiplier value.

As you conduct these tests more frequently, you will naturally memorize what value to input into your functions at different confidence levels. However, it may be difficult at first, so we recommend drawing out the curve, the area, and the multiplier you are finding to help visualize exactly what is being calculated and needs to be inputted. 



In our introductory blog piece, we discussed some high level statistical concepts which you can read HERE if you haven’t already or need a quick refresher. Now, we can finally start to dive more into the data and start running some tests. 

Correlation: A First Clue for a Relationship

A good place to start data analysis is by looking at correlation. Correlation, typically denoted as r in statistics language, is the relationship between the independent variable, x, and the dependent variable, y, and lies in between -1 and 1. Put even more simply, correlation tells us if there is a relationship occurring between two variables.

This is an important statistical concept, because it sets the stage for further investigation with our data. Finding the correlation between two values is our first clue that a relationship either exists, or does not exist, between the two variables. Using our knowledge about the strength of the relationship, we can then proceed to find out how good one is at predicting the other using linear regression methods. Essentially, correlation is the groundwork that allows us to determine which relationships are worth further investigation. Our correlation value can provide valuable insight on the strength and the direction of the relationship. 

Below is a comprehensive table that you can use to decode your correlation value: 

Positive correlation valuePositive direction / association; as one variable increases, so does the other
Negative correlation valueNegative direction / association; as one variable increases, the other decreases
|r| is closer to 1The closer |r| is to one, the stronger the association
|r| closer to 0The closer |r| is to zero, the weaker the association

Finding our correlation value, and many other statistical tools, is made easy in Excel. The function you will use to find r is: =CORREL(array1, array2). Array 1 will simply be all of the data for your first variable, and array 2 is all of the data for our second variable; the order is negligible. For example, if I was finding the correlation between GPA and SAT scores, I would first select all of my GPA data points, and then I would separate it from the second array with a comma, and then select all of my SAT data points. Then you press enter and voila! You have your correlation value. 

Going back to our example data set, we have created a correlation table that shows the correlation between all of our different variables. A heat map has been included on top to help visually show which have strong or weak associations and which have positive or negative associations. The more saturated the coloring of the box, the stronger the association. If the box is more red, the association is positive, and if the box is more blue, the association is negative. For example, a dark red box would indicate a strong, positive correlation, or a lighter blue would indicate a weak, negative correlation. 

Perhaps the most important thing to take away from this section is that correlation does NOT imply causation. Two variables may be correlated, but that does not mean that one causes the other; we cannot emphasize this enough. There may be other confounding variables, which are variables that are not accounted for, that may be the cause of the phenomena that we see. For example, let’s say you find there is a correlation between water consumption and the frequency of illness. It would absolutely ludicrous to draw the conclusion that drinking water causes you to be ill! Instead, we would use this finding of correlation to indicate that there is likely something else going on, such as the presence of cholera. While there is a correlation between the water consumption and illness, we know that the consumption of water isn’t truly what is causing the illnesses; it is the infectious bacteria that is contaminating the water that is causing people to become ill. This is why it is important to discern the difference between correlation and causation so you do not inaccurately report findings. Statistics lays down the ground works for your analysis, but proper research is necessary to really come to the true root of the phenomena.

AB Hypothesis Testing: A Statistical Experiment

Think of the science experiments that you used to do back in school. You pose a question that you are trying to answer; for example, how does sunlight affect plant growth? Then, you make an educated guess, or a hypothesis, as to what you think the results would be, and test whether or not your guess is correct. Statistical hypothesis testing is exactly the same; essentially, you think that there is some kind of relationship occurring between two variables, so you run a statistical experiment to see if your guess, or hypothesis, is correct. 

This process begins by looking at the correlation value, which will indicate whether or not there are variables worth investigating. For example, there was a high correlation between the grade, the size of the living room, and the square footage of the house above ground. Since the square footage of the house above ground and the size of the living room can take on an infinite number of values, that makes them continuous variables; this means that it would be advantageous to use linear regression to look deeper into their relationship with price, and even investigate if they can be used to predict it. However, the housing grade is a discrete variable, since there are a finite number of values that this variable can take. You cannot receive a housing grade of 2.5, only 2 or 3. Since grade is a discrete variable, it cannot be used in a linear regression model and would be tested best with the AB hypothesis testing method. 

From here, we are going to make a claim that we are trying to prove with the data; this is going to be the alternative hypothesis. However, a null hypothesis also needs to be set, which is a statement that essentially says that our inquiry was wrong, and the data does not support the claim made by the alternative hypothesis. However, the claim cannot be verified by just looking at the raw data; we will need a metric. The metric that is used to numerically show statistical significance is p-value. If you are statistics junkie, read our “For Statistics Junkies” section to learn more about the concepts behind p-value and how it shows us significance; if not, you can go ahead and skip it.

For Statistics Junkies : The “p” in p-value stands for probability, which is what we are actually trying to find during a hypothesis test. What is that probability exactly? It is the probability that we see the value we observe in our data, given that the null hypothesis is true. Put more simply, we have found some observed value in our data set, so what is the probability that we see that observation if the null hypothesis were true. If the p-value is small, that means that the probability of seeing that observation is low if the null hypothesis were actually true. If the p-value is large, that means that the probability of seeing that observation is high if the null hypothesis were actually true. When we say we are doing a test under the assumptions of the null hypothesis, all we are saying is that the test is conducted under the assumption that the null hypothesis is true, which really only affects how we interpret our p-value. 

If it is unlikely that we see an observation assuming the null hypothesis to be true, we can conclude that the null hypothesis is false, and there is sufficient evidence in the data to reject it. However, if it is likely that we do see an observation assuming the null is true, we would fail to reject the null hypothesis. Notice how we said fail to reject and not simply accept it. When we fail to reject the null hypothesis, we aren’t saying the null hypothesis is true; we are only saying that there isn’t enough evidence in the data to support the alternative hypothesis claim. 

Since we know that when we are finding probability, we also know we are finding an area under the curve. The figure below provides a visual of the area we are trying to find for our example hypothesis test with housing price.

To conduct the hypothesis testing on housing grade, the data was first sectioned off so that group one contained houses grades 7 through 13, and then group two contained houses grades 1 through 6. Note that on the grading scale, a grade of 1 is the lowest and a grade of 13 is the highest. Before conducting any portion of the hypothesis test, it is important to first establish a null and alternative hypothesis:

  • Null Hypothesis: There is no statistically significant difference between the price of homes with a grade between 7 and 13 and the price of homes with a grade between 1 and 6. 
  • Alternative Hypothesis: The price of homes with a grade between 7 and 13 is statistically significantly greater than the price of homes with a grade between 1 and 6.

Statistically, this is how to express the null and alternative hypothesis:

  • Null Hypothesis
  • Alternative Hypothesis

When translating statistic language, the actual letters being used to represent variables actually reveal a lot of information. When a Greek letter is used, that means it represents a population parameter. When a Latin letter is used, which is just the alphabet used for modern English, that means it represents a statistic from the sample. For example, “σ” represents the population standard deviation whereas “s” represents the standard deviation of the sample.

Here is a brief overview of the process of hypothesis testing. Following this portion, this process is applied to the housing data to investigate if houses graded 7 through 13 are priced higher than houses graded 1 through 6. 

  1. Find the test statistic with the following formula:

      1.  = the mean price of houses with grades 7-13
      2.  = the mean price of houses with grades 1-6
      3.  =  the standard deviation of the price of houses with grades 7-13
      4.  = the standard deviation of the price of the houses with grades 1-6
      5. n = sample size for each group respectively
      6. NOTE: In Excel, =STDEV.S must be used when finding s1 and s2 since the standard deviation is being calculated from a sample, not a population
  2. Calculate p-value with the following formula
    1. p-value = 1- T.DIST(t*, df, 1)
      1. This T.DIST function will take a test statistic and find the corresponding area under the curve, or in other words the p-value
      2. df = degrees of freedom = n-2 where n is the sample size for the entire data set
      3. NOTE: You must do 1-T.DIST(t*,1) because Excel functions take the area to the LEFT. Since the hypothesis test is one sided in the right direction, you need to take the area to the left and subtract it from 1 to leave us with the remaining area to the right. Note that if the hypothesis test was two sided, then you would multiply T.DIST(t*,1) by 2 since you need to find two areas under the curve.

Here is the process applied to the example data set:

  1. Test Statistic
    1. t* = (541096.73 – 531672.72) /
    2. t* = 1.1631
  2. P-Value 
    1. df = 21613 – 2 = 21611
    2. p-value = 1-T.DIST(1.1631, 21611, 1)
    3. p-value = 0.12

When following this procedure, the p-value works out to be 0.12. Since this is greater than the standard cut off of 0.05, we would fail to reject the null hypothesis. This does not mean that the alternative hypothesis is to be rejected; it means that there is simply not enough evidence to support the claim that houses with grades 7-13 are priced higher than houses with grades 1-6.

However, it should be noted that this does not express the magnitude of the difference, or lack of difference, in the prices. This is where practical significance comes in. In order to see the actual magnitude of the difference in prices, a confidence interval will be calculated. Essentially, this shows us how different these prices actually are, and if they are subjectively different enough to be noteworthy.

Explanation of Confidence Intervals

Confidence intervals are ranges in which the true population parameter lies. These compliment hypothesis tests perfectly; the hypothesis test will indicate whether or not there is statistical significance, and then the confidence interval will give the magnitude of this significance. For example, if there is a statistical significance between the price of pencils sold at Walmart versus Target, calculating a confidence interval gives a range between which the true difference in prices lies. If it is found that the difference in prices is somewhere between $0.10 and $0.25, it can be said that the findings are not practically significant. However, if it is found that the difference in prices is somewhere between $2.00 and $3.00, it can be argued that that demonstrates practical significance and thus relevance for intended audiences such as consumers.

Confidence intervals are conducted at various levels of confidence, indicating the probability that the true population parameter lies within the range. To be more technical, the confidence level says that we have x% confidence that the true population parameter lies within this range. The standard confidence level is 95%, which will be used for this example. When the confidence level is increased, the confidence interval will become wider. When the confidence level is decreased, the confidence interval will become narrower. This will be demonstrated later with the example data with reasoning as to why this logically makes sense.

Here is a visual representation of the task at hand where the red circle represents the multiplier value to first be found:

Here is a brief outline of how to find the confidence interval:

  1. Find the multiplier, t*
    1. This is NOT the same t* that used during hypothesis testing; it is a different multiplier that is found separately
    2. The opposite of the T.DIST function is used, which is T.INV(t*, df). This takes a probability / an area and degrees of freedom, and calculates the corresponding t* value. Referring back up to the figure above, since the t* value to be found includes ALL of the area to the left, the remaining area under the curve needed to be calculated and then added to 0.95. The area under the curve must add up to 1, so therefore the remaining two sections must equal 0.5; due to the symmetrical nature of the bell curve, the two areas on either tail are equal, meaning they are each 0.025. The 0.025 is added to 0.95 to get a total area of 0.975; this will be the area used in the function to find t*.
  2. Find the margin of error
    1. Margin of error = (t*) * standard error
      1. Where standard error = 
  3. Add and subtract the margin of error to the difference in sample means

Here the process worked out with the example housing data:

  1. T-Multiplier
    1. t* = T.INV(0.975, 21611)
    2. t* = 1.960074
  2. Margin of Error
    1. Standard error =  = 8102.85
    2. Margin of error = 1.960074 * 8102.85= 15882.17
    3. Lower Bound = (541096.73 – 531672.31) – 15882.17 =  -$6,457.75
    4. Upper Bound =(541096.73 – 531672.31) + 15882.17 = $25,306.59
    5. Confidence Interval = (-$6,457.75, $25,306.59)

When going through this process with the data set, it indicates that the difference between the two groups of homes is somewhere between -$6,457.75 and $25,306.59, where a negative number indicates that group two is actually priced higher than group one. Practical significance is much more subjective than statistical significance, so this price difference may not matter to some people, but it may to others. It should also be noted that the results of the confidence interval actually bolster the results found in the hypothesis test. In the hypothesis test, it was concluded that there is not enough evidence to support the claim that houses with grades 7-13 are priced higher than houses with grades 1-6. Looking at the CI, notice that the interval contains zero, meaning that there is a chance that the two groups of homes are in fact priced the same. It also includes negative numbers, indicating that it is possible that the houses with grades 1-6 are even priced higher than houses with grades 7-13. This bolsters the findings from the hypothesis testing, indicating once again that the null hypothesis should fail to be rejected. However, if the CI did not contain zero within the interval, that means that there it not a chance the prices are the same, and therefore the null hypothesis could be rejected. This is another way that confidence intervals are used to compliment and bolster the findings in hypothesis testing.

Following the same procedures outlined above, the CI was recalculated by decreasing the confidence level to 90% and then increased to 99%:


When the confidence level was increased, the CI widened. Intuitively, this makes sense since we can be more confident that the population parameter lies somewhere within this range since it contains so many possible values. The root of this is because more possible error is being accounted for as a result of a larger margin of error. 

To conclude, it has been identified that there is not a statistically significant difference between the price of homes with grades 7-13 and the price of homes with grades 1-6. Therefore, the null hypothesis fails to reject, as there is insufficient evidence from the data to support the claim that higher graded houses are priced higher. We have 95% confidence that the difference in price between the two housing groups is between -$6,457.75 and $25,306.59.

This process is easily repeatable, especially with tools like Excel, and therefore can be used to investigate numerous variables and characteristics in the data. Here is a comprehensive table of the results from the hypothesis testing with the housing data set: 

WaterfrontStatistically significant(958246.81, 1302378.04)
Renovation*Statistically significant(190329.86, 269706.56)
Renovation Year**Statistically significant(35718.74, 192890.85)
Condition***Not statistically significant(-5383.70, 15105.12)
Year Built****Not statistically significant(-21513.88, 2102.56)


* = We looked at whether houses grades that have been renovated were priced higher than houses that have not been renovated

**= We looked at whether or not houses renovated in or after 2000 were priced higher than those renovated before 2000

*** = We looked at whether houses with condition grades 4 or 5 were higher than those with grades 1,2, or 3.

**** = We looked at whether houses built in or after 2000 were priced higher than those built before 2000.


Thanks for tuning into our discussion on hypothesis testing! We hope you rejoin us back next time when we dive into linear regression modeling. 


Missed the first post in this series? Click here.

Missed the second post in this series? Click here.

Ready for the next post? Click here.



Welcome back! We’re glad you made it to Part 2 of this series. Let’s dive right in. 

One of the most important statistical concepts is the distribution curve. A distribution curve is a graph that shows the different frequencies of a certain variable. For example, if I were to make a distribution curve of scores received on a test, the graph would show how many people received each score. The area under the distribution curve is the probability that you see a particular instance occur. Going back to our previous example, if I wanted to find the probability of a student receiving a score between 80 and 90, I would find the area under the curve in-between those two test scores. The total area under the curve is equal to 1. These concepts will be important later when discussing more about p-value in hypothesis testing. 

In later posts we will frequently be referring to a specific type of distribution called a normal distribution, often referred to as a bell curve. It is referred to as a “bell curve” because it simply looks like a bell! A normal distribution curve is special because it is perfectly symmetrical. 

Sample size actually plays a large role in distribution, especially normal distribution. First, let’s give sample size a brief definition: the number of data points or subjects in your sample, typically denoted as “n.” Another way to think about it is simply how much data you have. For example, if you conducted a survey of 50 students in a high school, your sample size would be 50. The larger the sample size, the more normal your distribution becomes as posited by the Central Limit Theorem. Since many variables are naturally distributed normally, the more normal your sample distribution becomes, the better it is at approximating population parameters. This is easier to conceptualize if you can actually see if for yourself. Below we have put together a demonstration that shows how increasing the sample size causes the distribution curve to be more normal. 

For this demonstration, let’s use our example data set and look at the living room size. Let’s assume that the true population size is 21,613. We began by randomly drawing a sample of x number of living room sizes from the population and then graphed the sampled data in the form of a histogram to show the distribution. Note that the x axis represents the living room size in square feet, and the y axis represents the frequency in which it occurs. 

With 100 randomly selected sample data, our distribution curve has a wide spread and does not resemble a normal distribution curve at all. This indicates that a sample size of 100 is inadequate.
With 500 randomly selected sample data, the distribution curve has decreased spread and begins to look more normal than that sample of 100. This is a better candidate than the sample of 100, but we can still get a better model by using more data points.
With 1,000 randomly selected sample data, the data decreases its spread once again, as well as becomes approximately more normal. This is a better candidate than a sample of 500, but can become more normal with more data points.
With 10,000 randomly selected sample data, this is where we see a normal distribution curve begin to truly emerge. Since the spread is decreasing, we know that the number of outliers is also decreasing.
With 20,000 randomly selected data points, the curve looks about the same as the sample of 10,000, except it looks slightly more normal and has slightly less spread. This makes it a better candidate than 10,000 points, but likely only marginally.
Here is the distribution curve of the “population” of 21613 data points. Obviously, this has the most minimal spread and the most normal distribution. The sample of 20,000 points most closely resembles this curve.


As we can see, as our sample size increased, the distribution curve resembled our population curve more accurately. Once we gathered a sample size of 10,000, an approximately normal distribution curve emerges and improved as we increased the sample size to 20,000. This shows why it is so important to have a large enough sample size if we want to draw valid conclusions about the population from our sample. Something to note, however, is that the sample of 10,000 data points produces a curve that looks very similar to the 20,000 and 21,613 curves; although the 20,000 technically best represents the population, we can say that we will have very similar, if not identical, results when cutting that sample size in half to 10,000 since their respective curves are mostly the same. This helps to save both time and resources since you will have to collect significantly less data.  

More broadly, it also shows us why we do not necessarily need to use the full population if we want to accurately answer our questions. Why would I spend all of the time and money collecting 21,613 data points, when I could find the same conclusions with only, say, 10,000 data points? This is exactly why we use sample sizes and not populations for statistical exploration. As long as is is large enough, we can use a sample size that will accurately represent our target population.

But how do we know exactly what our sample size should be, and what is deemed large enough? Obviously we want as large of a sample size as possible, but due to finite time and resources, it may be necessary to simply find the minimum number of subjects required for valid conclusions about the population. Finding this value depends heavily on how accurate you want your results to reflect the population, which will largely be dictated by your margin of error. The smaller your margin of error, the larger your sample size will have to be. After setting a margin of error, you can use the following formulas to calculate the necessary sample size based on the type of method in use: 

  • If you are looking at the percentage / proportion of a population that contains a certain characteristic:
    • p = the estimated percent of the population that contains the attribute in question 
    • q = p-1 = the estimated percent of the population that does not contain that attribute
    • z* = z multiplier 
      • ex: 95% confidence level → = Z.INV(0.975) = 1.96 (don’t worry, we go through an example of how to do this step later in the section) 
  • If you are looking at a regression model, this one is a little bit more flexible and subjective
    • Green (1991) → n >= 104 + k
      • k = number of predictors
    • Rule of thumb → 10 subjects for every one predictor 

Note that if the calculated sample size has decimals, it is standard to round up to the nearest whole number, regardless of the value of the decimal. For example, a calculated sample size of 145.129 would be rounded up to 146, even though traditional rounding rules would indicate not to. Also note that margin of error must be expressed in decimal form when used in the formula. (ex: 5% → 0.05)

These formulas might seem like a foreign language if you’re non-technical, but don’t despair! They’re actually much more simple than they look. You have all of the information, you just need to plug it in and voila! All of the variables that you are plugging in are either numbers you already have or numbers you are determining yourself. If you are still a little bit confused, don’t worry! Before going on to the next section, we will run through an example for you. Let’s set the scene: you are trying to find out what percentage of houses in King County, WA have been renovated, but you are confused about how many houses are necessary to collect data on for accurate results. We will begin by finding each value we need, and then putting it together by plugging the numbers into the formula and finding our sample size. Let’s begin! 

  • p → we can estimate that approximately 50% (0.5) of the houses have been renovated; this is just an estimate, and does not need to be accurate
  • q → 1-p → 1-0.5 = 0.5; we can estimate that approximately 50% (0.5) of the houses have not been renovated; once again, just an estimate, so don’t get too caught up in figuring out this value
  • z* → let’s say we decided we want 95% confidence in our results, we would find our z* multiplier by calculating =Z.INV(0.975) in Excel, which gives us 1.96; 95% is a pretty standard confidence level to use
    • Refer the Excel Guide for an explanation of why we input 0.975 
  • Margin of error → let’s say we decide we want our results to be within 3 percentage points of the true population parameters, then our margin of error would be 3%, but would be put in as 0.03 in the formula.
  • NOTE: When we set our confidence level and margin of error, we are saying: “We want to have x% confidence that our results are within y percentage points of the true population parameter where x represents the confidence level and y represents the margin of error.”
  • n = 1067.11 → round up → 1068 housing points 


Missed Part 1 of this blog series? Click here.

Ready for Part 3? Click here.