Data Modeling sounds really scary, like a big chore and months of work.
But it is not so bad and you can get started in less than 10 minutes.
For this example we use Big Query and dbt. Big Query is one of Google’s cloud database offerings. dbt which stands for Data Build Tool is a data modeling tool created by Fishtown Analytics.
Big Query comes with a set of public data sets that are great for practicing data modeling on. I will be using the Stack Overflow data set they have.
You can start using Google Clouds various services for free but you will need to upgrade the billing so that you can connect dbt to Google Cloud. If you have not signed up for Google Cloud platform services before they will give you a $300 credit (which is more than enough to run this test thousands of times) so don’t worry about the costs in trying this out.
To install you can visit their documentation page here:
Or you can follow along below, most of what is here is straight from their docs anyways.
I suggest using homebrew to install dbt, it makes it extremely easy. If you do not have homebrew open your terminal on your Mac and put in the following command.
After it installs put the following commands into terminal.
You will also need git:
Create a folder on your computer (I named my dbt Projects). We are going to populate it with all the files and subfolders dbt needs to get started. We do this by navigating into that folder through the terminal.
Then we run the following inside of terminal.
I called mine BQSO, so my terminal command looked like this:
Navigate inside the folder to see all the folders and files dbt created for us
To get dbt to work with BigQuery we need to give it permission. The way to do this is by setting up profile (think account) with login information. Basically you have to create a profile in dbt's folder and then you will link that profile to this specific DBT project that you just created.
Go to dbt's profiles (a sample profiles.yml file was created when we ran the dbt init command)
This will pop open a file called profiles.yml which is the most challenging part of this tutorial. Configuring the profiles yml file. As a starter you can copy paste the code below to replace what is in the file.
Now I will mark where you will need to update with your own info with bold.
Name of profile - my-bigquery-db:
Type of db - type: bigquery
How you will connect - method: service-account
Name of project from BigQuery - project: healthy-terrain-239904
Name of Schema (Schema are called datasets in BigQuery) you will be putting the modeled data in - dataset: soCleaned
Your private key that you need to get from BigQuery - keyfile: /users/matt/BigQuerykeyfile.json
Once you update all of those fields in your dbt profile (profiles.yml) you now need to link that profile to the project we created.
Go to the project folder we had created earlier (BQSO in my case) and open the yml file inside of it.
Now you only need to update one thing in this file, you need to se the profile to the name we just created:
Go to the models folder in your project and create a new .sql file. In that .sql file you can write a SQL statement that's output will be modeled data. Try adding this text to the .sql file and save it:
Whatever you named the .sql file will be the name of the table in the schema (dataset) In my case I saved it as firstModel.sql
Go to terminal, make sure you are in the project folder of the dbt project and type
Boom, refresh Big Query and see the new table. You can query it with a simple.
Can you believe the most viewed post is about git? Classic.
You have now modeled data and queried modeled data. Not so scary right?
Well querying this modeled data took 0.2 seconds and processed 473bytes (granted this is just a single row with 20 columns)
When I do this query on the full Stack Overflow data set it took 20.3 seconds and processed 26.64 GB
There are many other things we can do with modeling and dbt such as cleaning up data, simplifying the schema, or find other ways to get more performance out of a query. Stay tuned for more!