Streaming events to BigQuery

January 17, 2018

Here are my notes from "Streaming events with Kafka to BigQuery and Logging" Meeting of Big Things Meetup which took place at Poalim Tech Offices. The work space is quite amazing and there were many people working as late as 10 PM.

Poalim Tech has is a great host for medium sized meetups. On the left, the obligatory pizza overdose

BigQuery - a serverless analytics
warehouse is the destination for the data.

Google BigQuery in brief

BigQuery is Google's Analytics serverless database solution based on colosul and providing as yet unmatched scaling capabilities. Usage cost are typically 5 USD per TB processed.

Pros are:

serverless data warehouse solution.
a powerful command line interface.
an SQL based interface with noSql performance.
Good code examples.

Cons are:

that queries can eat up many thousands of USD of compute time.
clunky web interface.

Apache Kafka in brief

Kafka is used to build streaming data pipelines

Apache Kafka is a highly performant free and open source message broker which allows asynchronous communication between consumer and producers of messages (messages in this case are web service based function calls). Transition to microservice architecture as well as speed and scaling concerns have made Kafka a key component in the modern enterprise's real time and streaming pipelines.

Streaming events

MyHeritage - stream change data from
Kafka into Google BigQuery

The first talk which was titled " From Kafka to BigQuery - A Guide for Streaming Billions of Daily events" by Ofir Sharony (a BackEnd Tech-Lead at MyHeritage). was about how my heritage use which use Kafka get their data into BigQuery. At a previous talk by MyHeritage engineers, they covered how they converted their monolith into microservices. Mr Sharony points that there are two type of data being placed into BigQuery. The first is database (the family trees created by the clients) and a second class is called "Change Data" which is an association of microservice events log together in context of web analytics of client session. This is becoming de facto the way microservices architecture are debugged.

Mr. Sharony outlines four iteration of Kafka to BigQuery Integration each with progressively simple architecture. They are as follows:

Take 1: Batching data to GCS

based on Secor and Google cloud storage.

Take 2: Streaming with BigQuery API

This iteration use he big query API. This was droppped since it required extensive error and exception handing which could be avoided using a kafka connector

Take 3: Streaming with Kafka Connect

He used an open-sourced Connector implemented by WePay but there were a number of issues, the BigQuery connector could only ingest data by its processing time which led some to data in the wrong partitions, and could not split events from a single stream to specific BigQuery tables and so they tried another solution:

Take 4: Streaming with Apache Beam

Talk video (hebrew):

Logging

The team have already mostly migrated from SAS to R and are also migrating to python. But the Data Science team at Bank Hapoalim tech have a number of unusual requirements and challenges.
First usually do not have to production level data. Any data they can uses must be sanitized of any sensitive PII (Personally Identifying Information).

Bank Hapoalim data scientists uses a HortonWorks
suite of 30 + open source software

Secondly they a suite of 30 of open source software tools from Hortonworks. But the software in use is based on a support contract from a vendor who changes the support of different software every year. This means that some software choices are unpopular forks of mainstream projects a decision made by the vendor rather than the data science team Also looking at some of the choices there is a significant effort to lock clients to these choices and away from competing stacks... Also since Hortonworks just provides support many of the tools in the stack are badly co-integrated. (Multiple SOLR instances with irregular levels of support for orchestration) Ofcourse to fully integrate the tools HortonWorks would need to have employ FOSS developers of sufficient standing in each projects so as to enable architectural changes for these integrations.

Thirdly their work is logs are stored for different time frames in some cases these duration are regulated:

Applications logs - are stored for some weeks for analysts to review data flows (un regulated)
Analytics log: are also stored for a few weeks (time, quality, descriptive statistics, confusion matrices etc).
Audit logs of decisions and why they were made are regulated and need to be stored for 7 year.

I find this vaguely amusing considering the Karkaeque nature of almost all decision made at bank being undisclosed non-transparent, and impossible to discover even who the decision makers are.

Talk video (hebrew):

References:

Looks like the material in the first talk is based on this blogpost. However the talk had been expanded and updated since the post.
A slide deck for the first talk is here.

Search This Blog

Bayesian.Ninja

downloading folders from google drive.