When multiple people ask the same question and use data to get answers that aren’t the same, it creates doubt in all of the data in your organization. This is demoralizing for everyone and time consuming to figure out who if either was actually right. Unfortunately, this is typical when data has not been cleaned up into a Single Source of Truth.
People get inconsistent results because:
- Data sources change
- Schemas are overly complex
- Table and column names are confusing
- Undefined Metrics in the data
What is a SSoT Architecture?
A Single Source of Truth is a database where the data is accurate and is used by everyone in a company when querying for data.
The promise of a Single Source of Truth is accuracy across your organization.
This is an obvious thing that a company want but a lot of companies struggle to ever deliver this. Creating a Single Source of Truth requires data engineering effort. Let’s explore the problems that SSoT solves, where it often goes wrong, and best practices:
Companies buy and use services and tools that generate data to conduct their business. The set of services and tools providing data (data sources) changes as companies grow or use cases shift. Before you have built a single source of truth, your company will likely have data sources that overlap in terms of what they track. You will also have data from data sources in your data lake that the company don’t use anymore, but the data is still needed for certain analyses.
Imagine you were tracking sign-ups via Hubspot and then after a year started using Salesforce to track sign-ups. The Salesforce data will be empty before the date you started using it and the google analytics data might not be as well maintained going forward. To an analyst, when they go to query for sign ups it will be unclear which data source they should use.
Consolidate Data Sources
When your company has used multiple tools to track the same type of data, if you can, migrate the data from the previous tools into the latest tool. If this is not an option, use the data warehouse to create a table which unions the data from both sources so that the historical record is not lost and to have one place to go to for the relevant metrics. This will require some renaming and cleaning to accomplish.
In addition, if you want to maintain access to old/unused data sources from your Data Lake in your Data Warehouse, you can label data sources as deprecated or approved to help guide people during their analysis.
In a Data Lake the schema reflects in app logic and follows best practices such a 3rd normal form so that updating values will not produce errors. This type of schema can be difficult to navigate and many tables will never be used in an analysis. In the past books recommended using dimensional modeling to reduce the schema complexity to make it easier to query and more performant. Due to advances in BI tools such as Chartio and Data Warehouse technologies dimensional modeling is no longer worth the effort.
We create a single source of truth by creating views on top of the existing schema. There is no need to move away from 3rd normal form. The main thing we want to do to simplify the schema is to not include tables in the new views that only contain app specific logic and are not useful for analysis. However if you do want to make it even easier to work with a specific set of data you can create a wide table (view) that does all the joins. This can sit alongside the cleaned up normalized version of the data warehouse.
Simplify Tables and Columns
Table and column names are typically created by engineers to be used by the application the data is from. Table names, column names, and even a columns whole purpose for being in the table can be confusing to whoever did not write the code. This makes it challenging for business users to analyze the data without consulting with the engineer.
Naming convention and style guide
When going through and recreating the schema with views of the relevant tables you should also clean up what is in each table. Do not include irrelevant columns and rename any columns that are confusing. Create and follow naming conventions to help people analyze data consistently because it clarifies what every column is and how it can be used.
Here are some naming conventions we follow at Chartio:
Publish this style guide and distribute it among all of your employees, adoption of known terms becomes easier and easier.
There are a lot of different ways to measure how a business is performing. Some are fairly well known such as Monthly Active Users or Number of Trials started. In most businesses getting an accurate count on a metric is difficult because you need to filter out irrelevant data:
- Test accounts
- Internal Company emails
- Non product-related page visits
- Users that are no longer employed by a client company
Not filtering out the right data will affect your analysis negatively. If presented to others and they have a conflicting analysis it will cause everyone to lose trust in the data.
Another more subtle problem with metrics are their abbreviations. If Monthly Active Users is abbreviated as MAU in the database, it may be misinterpreted in someone else’s analysis. Do not assume everyone understands the abbreviation for the metrics you are reporting.
Create Standard Metrics
To define the calculation of a metric, create a Dashboard with this metric in it and provide text on the dashboard to explain how it was calculated and what has been filtered out. Make this easily searchable!
Another approach is to pre-calculate the metric in a view that is part of the Single Source of Truth database. We recommend doing this through a SQL based modeling tool such as dbt or Dataform. Defining the metric in the database will remove most, if not all, of the confusion.
To eliminate any remaining confusion on using the metric in your analysis, many SQL based modeling tools can add a data dictionary to the data model. This allows the author of the data model to write out comments on why it was calculated that way and why certain data was filtered out.
Putting the metric in the database through modeling allows you to control changes in the data and the definitions systematically. You still will need to communicate out to users of that table of the changes but it will be documented for them if they check on their own.
- Create a Single Source of Truth and give employees access to it and only it
- Make your data intuitive through naming conventions and style guides.
- Simplify Schema by excluding app specific logic tables
- Simplify table and column names by getting column and table names out of the language of technical jargon and define them by their spoken language titles.
- Centralize the control of the metric calculations through SQL based modeling so those metrics are accurate.