From the previous chapters most of the ambiguity of what is going on the dashboard should have been addressed:
However the biggest challenge can be the “How”, as in how do we get the data to build the actual dashboard. The three data scenarios you will encounter are:
If you have the data this will be an easy step of the process. If the data is messy or it does not exist things will get much more challenging.
Now you have to find the data that you will use to calculate those metrics in your database. Review the schemas of the databases you have access to.
Finding where the data is that you need can be the hardest part of the process. Often times data is not well documented and what you need could be spread across multiple databases. First search for tables that have keywords from your Metrics Spreadsheet.
When you find something promising such as a table with the same keyword you searched or contains a field that matches your keyword do a quick query:
Once you get the results you can quickly see if it has the relevant information. In this case we can see it has a department column that looks appropriate, and the amount column may be the cost data we are looking for.
If a table such as this one looks relevant to one of the metrics write it down or put it directly in the Metrics Spreadsheet.
Your next step will be to determine what data you cannot find yourself. If you find a table but are unsure if it is the correct data put the name of the table in with a ? at the end. If you cannot find any relevant tables for a metric put three ? in the Metrics Spreadsheet.
If you have any questions about the tables or columns you have found, it is time to consult the Data Gatekeeper. I’d highly recommend coming to the Data Gatekeeper with the list of tables and fields you have questions about, using the metric spreadsheet is a convenient way to structure the conversation.
Go through each metric with the Data Gatekeeper, and explain what tables and fields you think you should be using and which ones you have questions about.
The Data Gatekeeper will confirm which tables and fields you have selected so far or will help locate the tables and fields you need. Some of the data you need might not be accessible to you due to access permissions. The Data Gatekeeper may grant you access or will provide feedback about how to work around this limitation. After locating the relevant tables, update the Metrics Spreadsheet.
We also need to specify the fields within the tables that will be used to make creating the SQL queries easy. Place the field names from the tables you found into the Formulas directly and put them in parentheses below your grouping categories in the Content column. Notice ‘Total’ does not have a column associated with it because it is the full aggregation.
Well you have data but it is messy. It may be obvious such as missing values. It may be mysterious, such as a value you have is different than what a user of your application is seeing. Let’s explore what to do about different messy data problems.
On any column that you will be using in a metric calculation you should check for Nulls and blank records. Here is an example query to get the Total number of nulls and nulls as a percent of total records.
To check for blank values we can use:
You need to evaluate how to treat missing values.
Ignore - leave the values as they are
Delete - remove any records with missing data
Impute - replace the missing value with a value
Regardless of which option you choose be sure to have the decision documented so others can reproduce the calculation and understand if they should take the data they are seeing to be 100% accurate or not.
On any column that you will be using in a metric calculation you should also check for bizarre values. The Data School recommends doing a quick check on the highest and lowest values of any of the fields that will be used. You can do this using the ORDER BY clause.
This will quickly surface values that are way off if they are in the field.
In text fields this is more difficult to detect, however there are some tricks here as well. They are more use case oriented tips.
While the previous method showed us obvious outliers there can be more subtle outliers that you may want to address. A commonly accepted definition of an outlier is 1.5 * IQR + Q3 and Q1 - 1.5 * IQR. This formula which is one and a half times the interquartile range added to the upper quartile or subtracted from the lower quartile finds outlier values.
Quartiles split a quantitative variable up into 4 sections
The interquartile range is the difference between the upper quartile (Q3) and the lower quartile (Q1).
Here is an example query applying this formula to find outliers using IQR.
Again in text fields this is more difficult to detect. Watch out for the following:
SQL will treat each of these variations as unique, you can find these by grouping by the column the text values are in and reviewing any groups with very few records in them. This is potentially a sign that they should have been incorporated into a larger group but weren’t due to misspellings and capitalization inconsistencies.
Sometimes you do not have the data in your database to calculate a metric. This can be seen as a huge roadblock but there are a few ways forward. We need to first ask ourselves the following to know what we can do.
Do we care about historical data?
Is the metric actually trackable?
How much data is needed for this metric to be valuable?
If historical data is of no concern consider working with the Data Gatekeeper to talk with engineering about starting to track what you want. This will delay the project from completion based on the engineering resources available to help.
Consider the costs
If the metric desired here is for a statistical test talk with the Data Gatekeeper about how much data will be necessary to do the test properly. Factor in how much time it will take to get that data and let the Point Person know. Often times people will want to draw conclusions as quickly as possible, do not let your dashboard get used incorrectly like this. Set expectations up front when instrumenting new data points and perhaps even note on the dashboard when conclusions can be drawn.
If we want historical data but we do not have the data for the metric we wanted we can use proxy metrics. Proxy Metrics are metrics that give us the same or similar information to what we wanted.
The Point Person wants to know what customers think of the product.
Desired Metric: NPS
Proxy Metric: Return Visitors
While Proxy Metrics are not exact they can give us a good estimation. Finding good proxy metrics can be dangerous because they can change the focus of your company.
Consider a company who has to wait a long time to see if their product was valuable to the consumer such as an education company. They provide a lot of knowledge but if their ultimate goal is job placement they have to wait several months to know if their education was effective for the student.
How can they measure if they are doing a good job before their desired metric event happens?
You could introduce surveys to measure how confident students are in their job prospects, you could measure graduation rates, or you could measure the amount of inbound interest in your students from outside employers. Each one of these Proxy Metrics which you might be able to assess sooner than the “did they get a job” metric will lead you to focus on different activities. So be cautious with Proxy Metrics