Friday, November 17, 2017

My first BigQuery DWH

BigQuery is Google's Analytics Database
Some notes about a project that has taken up lots of time recently. It was a media attribution dashboard for a client running several hundred campaigns. A POC version of the project had been created manually using spread sheets and we had to provide a drop in a replacement ASAP . I took up the task of migrating a spreadsheet based BI to a more robust script and SQL based platform able to handle the rapidly aggregating data which would soon overpower the spreadsheet's models. A secondary challenge was that the entire system were analyzing was under development and would change daily. Despite it's lacks as a classical database (missing triggers and in schema protections) I choose BigQuery for its scale-ability and ease of integration. Despite its limitations it soon felt like a perfect fit for this type of project.

Data collection

  • Data is currently acquired daily via API from various platforms: for example Google AdWords and Google Analytics.
  • I got an initial jump start by modifying a sample AdWords script from Google's developer pages which demonstrated Big Query writing capabilities. Although this system is fully hosted by google it was a decision I would ultimately regret but more on that later.
  • I added support for additional Goggle API queries. 
  • Next I had to handle challenges arising from a  mismatch between field names and foreign key formats between data-sets. So I the initial script became an ETL  with new capabilities in each iteration. Fixing dates, converting string to numbers and handling different formats of null values.
  • Finally I added cleaning up date to the script to fix these.

Stored Procedures

The DWH based on a the classic star schema retained pretty much the same schema and physical layout of tables and views. However there were a number of revisions to the stored procedures controlling the views as I found that the incoming data lost key data or change unexpectedly. The number of abstract entities multiplied due to a new requirement to be backwards compatible with the media campaigns going back many months. Handling changing data as the ETL evolved and rebuilding the BigQuery became unwieldy.
  • To increase the project's agility I added to my script a key capability - to recreate all the stored procedures pragmatically if missing at every run of the project. This allowed me to drop a data set and rebuild everything from scratch as the ETL evolved despite BigQuery's clunky web interface which had rendered progress unwieldy. 
  • To allow creation of the BI front end in parallel to development I bifurcated the project to run   production testing and development versions concurrently each on an independent BigQuery data set
  • BigQuery's standard SQL requires stored procedures to to access tables in as fully qualified up to the data set level so to support multiple version I modified the ETL to rewrite the queries to conform with the current version of the specific data set it should work on.
  • As mentioned BigQuery's legacy queries was inferior to working standard SQL queries. 
While adding these capabilities took a couple of days it significantly increased agility and over the project lifetime was a very significant time saver as it became possible to store very many queries in code and only run the latest versions.

However keeping all the SQL in one large file increased the size of the code base and AdWords scripts does not have minimal editing abilities so It I regretted not placing all the SQL queries in a project which handled multi-line strings would have reduced the adjustments needed to the queries. Due to the project's time frame I had to keep plugging based on the initial decisions.

Testing and QA

Testing and QA became increasingly important as I strove to improve the quality of the project's final output table which detailed attribution of aggregated campaign results to each campaign and products spending could be consumed by any one of several BI platforms (I used Google's Data Studio but tested Tableau and Click Sense as front ends as well). By creating a QA dashboards outlining the data in each table and view of the DWH  I was able to quickly spot leaks and duplicated or missing data.
Data Studio is still in beta and undergoing significant development and several times I saw the QA environment collapse due to changes which I could not control. Ultimately I decided to split QA into a set of sanity test run in code and a simpler set of Dashboards to allow inspection of the issues which were less important over time as very high levels of accuracy has already been reached and the concern had shifted to checking the product's consistency.