Money ball - Sabermetrics, you too can analyze data…

I wanted to write a blog to demonstrate the power of integrating dbt within Fivetran to load data into Snowflake. So, I thought about what data is freely available and decided that rather than one more analysis of pandemic data I’d prefer to focus on the upcoming baseball season.

site-image
Fig 1: Our Pipeline Software Stack


Assumptions for running this tutorial

This post demonstrates how to create a data pipeline for using Fivetran, Snowflake, dbt and git. As such, you will need have the following to do this tutorial:

  1. A Snowflake account to use as a target data connection.
    You can get a free trial here: Snowflake Signup

  2. A Fivetran account.
    You can get a 14 day trial here: Fivetran Signup

  3. You will need dbt core installed locally.
    Here are instructions for setting it up: dbt setup

  4. You will need a git account of some sort.
    You can sign up for a free github account: Github Signup

Play Ball

Baseball data background… “blah blah blog”

A quick Google search on free data helped me find this site: Baseball Archive Statistics
Where I downloaded the baseball databank from here: Baseball Databank

And extracted into separate google sheets:

site-image
Fig 2: Data stored in Google Sheets - the folders

site-image
Fig 3: Data stored in Google Sheets - the files

Sample of batting statistics:

site-image
Fig 4: Baseball Data

And the data contains definitions in this readme.

readme:2.9  BattingPost table

yearID         Year
round          Level of playoffs 
playerID       Player ID code
teamID         Team
lgID           League
G              Games
AB             At Bats
R              Runs
H              Hits
2B             Doubles
3B             Triples
HR             Homeruns
RBI            Runs Batted In
SB             Stolen Bases
CS             Caught stealing
BB             Base on Balls
SO             Strikeouts
IBB            Intentional walks
HBP            Hit by pitch
SH             Sacrifices
SF             Sacrifice flies
GIDP           Grounded into double plays

Batter up

We are going to create a table in Fivetran using the BattingPost google sheet loaded previously.

Adding your first connector (this step depends on having setup the Fivetran User with the appropriate privileges in Snowflake).

site-image

In Select your data source Search for Google and select Google Sheets. Click Continue Setup:

site-image

On the Google Sheets page enter the following:

Destination schema: `baseball_data`
Destination table: `batting_post`
Authentication Method: choose Authorize Service Account
Sheet URL: https://docs.google.com/spreadsheets/d/1GYIaA0CXFx88wH8CgwMQcRYCN8YxiX_-y8YTeN-VUg4/edit#gid=2089050719

You may have to scroll down to see the Find Sheet button. Click on it and you should find DATA in the Named Range drop down.

site-image

Then click 'Save and Test':
site-image

You should see this if everything is successful:

site-image

Click Select your data to continue.

You should get a warning for starting to do the initial sync. As this is public data it is not a concern. But be mindful of your company’s data governance policies for your production environment. Click start initial sync:

site-image

This launched the first sync and by default set the next one to 6 hours:

site-image

In Snowflake, we can see that the schema BASEBALL_DATA was created along with the BATTING_POST table and FIVETRAN_AUDIT table.

site-image

We can preview the data and see that it loaded as expected:
site-image

Those column names are ugly though. We could probably make them look better with human readable names.

We can transform the data within Fivetran. And, we have choices. We can use Fivetran’s Transformations or we can use dbt from within Fivetran. First, from within Fivetran on the left hand navigation choose Transformations:

site-image

Then select Try dbt Transformations (you will authorize your github repo to be used by Fivetran)

site-image

Click to Enable dbt Transformations then you will be presented with this requirement to proceed:

site-image

I hadn’t setup a repo previously, so I chose Help me set up my dbt project.

This brings you to this page which is pretty self explanatory: Transformations for dbt Core Setup Guide
The notes I’ll add are here:

  1. From the prerequisites you’ll need to install dbt.

  2. I’ve found this helpful for setting the dbt grants in Snowflake.

  3. You can skip Step 1 and begin with Step 2 - Create a new repository.

    **I have a github account and am going to add a repo to it.

  4. Step 3 doesn’t mention adding the Snowflake database info. But, you’ll need to know info to setup your profile:

site-image

When I completed the Transformations for dbt Core Setup Guide I got the following error:

site-image

dbt project sync has failed deployment.yml file not found. Make sure that the deployment.yml and dbt_project.yml files are present in the same folder.

I didn’t see in setup or pre-requisites an example of adding the deployments.yml file. Here is a link to a sample from Fivetran.

I chose to create a job that runs every 5 minutes while developing and unit testing and then change the schedule after that to less frequent:

site-image

With this being setup dbt has a job that will rebuild the existing models that are setup in the project:

site-image

But, we haven't started using dbt to work on the baseball data.

We’ll start transforming our BATTING_POST data.

Letting dbeaver generate and format a select statement gives a query that is not nice to look at:

	SELECT
		"_ROW",
		BB,
		AB,
		LG_ID,
		G,
		H,
		IBB,
		HR,
		YEAR_ID,
		"_3_B",
		SB,
		GIDP,
		"_2_B",
		CS,
		R,
		SF,
		ROUND,
		SH,
		TEAM_ID,
		HBP,
		RBI,
		SO,
		PLAYER_ID,
		"_FIVETRAN_SYNCED"
	FROM
		FIVETRAN_DATABASE.BASEBALL_DATA.BATTING_POST;

This should only persist in a raw area. Let’s create a model to clean it up.

Using the data dictionary we can produce a better looking query to use in our staging model.

site-image

And along with the new sources file src_baseball_data.yml.

site-image

We should be ready to use dbt compile to execute our code.

site-image

Dbt run completes successfully:

site-image

And in Snowflake this shows us that we have human readable data. But, we haven’t shown that Fivetran added the data. With successful runs of the same data, I added time to the job. Now, I want to make sure that Fivetran is executing the run and not my command line. So, I drop the view.

site-image

And update the schedule to run on the 5's again.
It completes successfully:
site-image

And the view has been rebuilt by dbt.

site-image

Summary

In this tutorial we’ve just begun to see how Fivetran can be integrated to use dbt within its pipelines.

Evolv consulting has experts in each of the technologies discussed here as well as many others to assist you in building out your modern Cloud Data Solutions. Contact us today for assistance.