Designing, improving, and creating data warehouses is what we do day in and day out at dbSeer. If you are building a data warehouse or if you simply want to make improvements to an existing one, there are five core principles that you must keep in mind. These are the best practices we’ve learned along the way and still use today.
- Data Quality (the top priority!)
Data quality is essential because if you don’t manage the quality of the data, you will lose the trust of the data consumer. While it might not be possible to address every data quality issue, you should have the ability to quantify the accuracy of the data for the end user. In order to get the quality correct, you need to have a system in place that captures and logs data issues in transit. That’s not all, though, you also need a mechanism to capture data issues at rest (i.e., the stored data), so you can make sure it’s not corrupted. Furthermore, there should be a framework that can automatically clean the data, or, alternatively, you should have a process so that you can do it yourself. I’ll say it one more time: data quality and the ability to measure the cleanliness of the data is critical.
- Traceability
Everything within your framework should be traceable. In order to have a successful production-ready environment, a non-DBA, such as a system administrator, application administrator, or business analyst, needs to be able to identify potential issues in the data warehouse or ETL Processes. There must be a mechanism in place that can trace the data movement in each step (the data size, how many rows, operations/manipulations that are happening on the data, transformations or enrichments, etc.) This information should then be available on an interface for the DBA and non-DBA – the people who aren’t experts. Over time, you will build a history that you can analyze to identify any potential issues.
- Modularity
A data warehouse commonly brings in a variety of data sources and serves the user for many reasons. You need to recognize the context, function, and purpose of each specific data warehouse. While there might be a cross reference between the subject areas, the goal of the database design should be to isolate modules whenever possible. If you design it this way and make sure it is in your ETL processes, the end user has the ability to turn on and off certain modules and configure it to meet their needs. This makes the design more robust.
- Purpose Built for Analytics Consumption
One of the mistakes that we see often in our work is that people give technology a higher priority than analytics needs. This is a mistake – you shouldn’t let the technology drive the requirement; it should always be analytics over engineering. The priority is the end user and the analytics you’re building for that user. When you focus on engineering, the design becomes the main objective, when analytics should always be the first priority. Reporting analytics should dictate the structure, aggregation, and granularity of the data.
- Adaptability
It’s common for the data warehouse to evolve as the requirements change, due to expansion or new business requirements. The data warehouse design should lend itself to that evolution. In order to do this, you should ensure that the modularity allows you to add and remove different parts. It’s always a good idea to build multi-level granularity within your dataset and data warehouse. It should be built to be flexible and allow for changes in the future.
You might be wondering if your data warehouse was designed with these principles in mind. Or, you’re building a new data warehouse and want to start off using these best practices. dbSeer is able to conduct an audit of your data warehouse against these five core principles. We will evaluate your data warehouse against our 70-question checklist and provide you with a report showing you where the gaps are and how you can make improvements. Reach out if this sounds like something you’d be interested in!