Start Modeling Data

7
min read

Data Modeling sounds really scary, like a big chore and months of work.

But it is not so bad and you can get started in less than 10 minutes.

For this example we use Big Query and dbt. Big Query is one of Google’s cloud database offerings. dbt which stands for Data Build Tool is a data modeling tool created by Fishtown Analytics.

Image result for big query logo

https://cloud.google.com/bigquery/

Big Query comes with a set of public data sets that are great for practicing data modeling on. I will be using the Stack Overflow data set they have.

You can start using Google Clouds various services for free but you will need to upgrade the billing so that you can connect dbt to Google Cloud. If you have not signed up for Google Cloud platform services before they will give you a $300 credit (which is more than enough to run this test thousands of times) so don’t worry about the costs in trying this out.

Installing dbt

dbt-logo-full.png

To install you can visit their documentation page here:

https://docs.getdbt.com/docs/macos

Or you can follow along below, most of what is here is straight from their docs anyways.

I suggest using homebrew to install dbt, it makes it extremely easy. If you do not have homebrew open your terminal on your Mac and put in the following command.

CODE:https://gist.github.com/fronofro/1674f3b95c5416baa908ce4bd2a11abe.js

After it installs put the following commands into terminal.

CODE:https://gist.github.com/fronofro/ff055336ed408bdd30cd0f837f102cdf.js

You will also need git:

CODE:https://gist.github.com/fronofro/59783e7429bc86b1c85d28a9072f969a.js

Create a folder on your computer (I named my dbt Projects). We are going to populate it with all the files and subfolders dbt needs to get started. We do this by navigating into that folder through the terminal.

Then we run the following inside of terminal.

CODE:https://gist.github.com/fronofro/8923a1265c759cc4add3270478bf9080.js

I called mine BQSO, so my terminal command looked like this:

CODE:https://gist.github.com/fronofro/bcaa57aa88943fcc51fc26aaa45b0fd1.js

Navigate inside the folder to see all the folders and files dbt created for us

CODE:https://gist.github.com/fronofro/94724f469a817c360f50963a2399ef8d.js

Configuring Big Query

To get dbt to work with BigQuery we need to give it permission. The way to do this is by setting up profile (think account) with login information. Basically you have to create a profile in dbt's folder and then you will link that profile to this specific DBT project that you just created.

Go to dbt's profiles (a sample profiles.yml file was created when we ran the dbt init command)

CODE:https://gist.github.com/fronofro/fbcd2e9e84db07740aa0dd34aacc2df9.js

This will pop open a file called profiles.yml which is the most challenging part of this tutorial. Configuring the profiles yml file. As a starter you can copy paste the code below to replace what is in the file.

CODE:https://gist.github.com/fronofro/5f66dcc32f848ad32f5314ab67745318.js

Now I will mark where you will need to update with your own info with bold.

Name of profile   -   my-bigquery-db:

  • This is the name will be used to link the profile (account details/login info) to the project
  • I think this name makes sense but feel free to change it to whatever name you would like

Type of db   -   type: bigquery

  • No surprises here

How you will connect - method: service-account

  • This is specific to the database chosen, for bigquery this is how you do it.

Name of project from BigQuery - project: healthy-terrain-239904

  • This is the project name, it will be a weirdly named thing inside of BigQuery on the left.
  • Replace healthy-terrain-239904 with your project name.
  • You may need to create a new project in BigQuery but there should be a default one which is fine to use for this example

Name of Schema (Schema are called datasets in BigQuery) you will be putting the modeled data in - dataset: soCleaned

  • Inside of BigQuery
  • Click on your project (happy-vegetable-211094)
  • On the right you will see Create Data Set, click that
  • The Dataset ID will be the name of the schema
  • Replace soCleaned with your schema name that you put in the Dataset ID

Your private key that you need to get from BigQuery - keyfile: /users/matt/BigQuerykeyfile.json

  • You do this by going to IAM & admin in BigQuery (hidden in the hamburger menu on the left)
  • Click Service Accounts
  • Click Create Service Account
  • Create a name for it (a name like dbt)
  • Select Role - BigQuery Admin
  • Create key - JSON
  • This will download the key to your computer
  • You will need to put the file path in the yml file so place it somewhere that makes sense to you

Once you update all of those fields in your dbt profile (profiles.yml) you now need to link that profile to the project we created.

Go to the project folder we had created earlier (BQSO in my case) and open the yml file inside of it.

dbt_project.yml

Now you only need to update one thing in this file, you need to se the profile to the name we just created:

CODE:https://gist.github.com/fronofro/70ce8f56169676cfc89cf6bacf55c991.js

  • This is the link to the profile we just created, so if you changed that name to something else replace 'my-bigquery-db' with whatever you created. It does need the single quotes around the name of the profile.

Creating a New Table with Modeled Data

Go to the models folder in your project and create a new .sql file. In that .sql file you can write a SQL statement that's output will be modeled data. Try adding this text to the .sql file and save it:

CODE:https://gist.github.com/fronofro/e7eb68a97a2554963b25ae59e1cb3aba.js

Whatever you named the .sql file will be the name of the table in the schema (dataset) In my case I saved it as firstModel.sql

Running your Models

Go to terminal, make sure you are in the project folder of the dbt project and type

CODE:https://gist.github.com/fronofro/e058614403d5bdb3368470b26f303988.js

Boom, refresh Big Query and see the new table. You can query it with a simple.

CODE:https://gist.github.com/fronofro/cffe9fbcfce5fc1e6bbc17a1076c24d5.js

Can you believe the most viewed post is about git? Classic.

You have now modeled data and queried modeled data. Not so scary right?

Why did we did we model this in the first place?

Well querying this modeled data took 0.2 seconds and processed 473bytes (granted this is just a single row with 20 columns)

When I do this query on the full Stack Overflow data set it took 20.3 seconds and processed 26.64 GB

There are many other things we can do with modeling and dbt such as cleaning up data, simplifying the schema, or find other ways to get more performance out of a query. Stay tuned for more!

Give Feedback on our Google Doc

Chapter topics