This blog is where I share insights from any number of adventures in data analysis.
I will cover best practices from modeling and analysis projects; share tips on using new tools; outline new projects; recount war stories from Wikipedia and other FOSS projects I contribute to, and discuss challenges on information retrieval challenges, natural language processing tricks and game theoretic insights, portfolio analysis, social network analysis.
I wanted to download some course material on RL shared by the author via Google drive using the command line. I got a bunch of stuff using wget a folder in google drive was a challenge. I looked it up in SO which gave me a hint but no solution. I installed gdown using pip and then used: gdown --folder --continue https://drive.google.com/drive/folders/1V9jAShWpccLvByv5S1DuOzo6GVvzd4LV if there are more than 50 files you need to use --remaining-ok and only get the first 50. In such a case its best to download using the folder using the UI and decompress locally. Decompressing from the command line created errors related to unicode but using the mac UI I decompressed without a glitch.
Get link
Facebook
Twitter
Pinterest
Email
Other Apps
Streaming events to BigQuery
Get link
Facebook
Twitter
Pinterest
Email
Other Apps
Here are my notes from "Streaming events with Kafka to BigQuery and Logging" Meeting of Big Things Meetup which took place at Poalim Tech Offices. The work space is quite amazing and there were many people working as late as 10 PM.
Poalim Tech has is a great host for medium sized meetups. On the left, the obligatory pizza overdose
BigQuery - a serverless analytics
warehouse is the destination for the data.
Google BigQuery in brief
BigQuery is Google's Analytics serverless database solution based on colosul and providing as yet unmatched scaling capabilities. Usage cost are typically 5 USD per TB processed.
Pros are:
serverless data warehouse solution.
a powerful command line interface.
an SQL based interface with noSql performance.
Good code examples.
Cons are:
that queries can eat up many thousands of USD of compute time.
clunky web interface.
Apache Kafka in brief
Kafka is used to build streaming data pipelines
Apache Kafka is a highly performant free and open source message broker which allows asynchronous communication between consumer and producers of messages (messages in this case are web service based function calls). Transition to microservice architecture as well as speed and scaling concerns have made Kafka a key component in the modern enterprise's real time and streaming pipelines.
Streaming events
MyHeritage - stream change data from
Kafka into Google BigQuery
The first talk which was titled " From Kafka to BigQuery - A Guide for Streaming Billions of Daily events" by Ofir Sharony (a BackEnd Tech-Lead at MyHeritage). was about how my heritage use which use Kafka get their data into BigQuery. At a previous talk by MyHeritage engineers, they covered how they converted their monolith into microservices. Mr Sharony points that there are two type of data being placed into BigQuery. The first is database (the family trees created by the clients) and a second class is called "Change Data" which is an association of microservice events log together in context of web analytics of client session. This is becoming de facto the way microservices architecture are debugged.
Mr. Sharony outlines four iteration of Kafka to BigQuery Integration each with progressively simple architecture. They are as follows:
Take 1: Batching data to GCS
based on Secor and Google cloud storage.
Take 2: Streaming with BigQuery API
This iteration use he big query API. This was droppped since it required extensive error and exception handing which could be avoided using a kafka connector
Take 3: Streaming with Kafka Connect
He used an open-sourced Connector implemented by WePay but there were a number of issues, the BigQuery connector could only ingest data by its processing time which led some to data in the wrong partitions, and could not split events from a single stream to specific BigQuery tables and so they tried another solution:
Take 4: Streaming with Apache Beam
Talk video (hebrew):
Logging
The team have already mostly migrated from SAS to R and are also migrating to python. But the Data Science team at Bank Hapoalim tech have a number of unusual requirements and challenges.
First usually do not have to production level data. Any data they can uses must be sanitized of any sensitive PII (Personally Identifying Information).
Bank Hapoalim data scientists uses a HortonWorks
suite of 30 + open source software
Secondly they a suite of 30 of open source software tools from Hortonworks. But the software in use is based on a support contract from a vendor who changes the support of different software every year. This means that some software choices are unpopular forks of mainstream projects a decision made by the vendor rather than the data science team Also looking at some of the choices there is a significant effort to lock clients to these choices and away from competing stacks... Also since Hortonworks just provides support many of the tools in the stack are badly co-integrated. (Multiple SOLR instances with irregular levels of support for orchestration) Ofcourse to fully integrate the tools HortonWorks would need to have employ FOSS developers of sufficient standing in each projects so as to enable architectural changes for these integrations.
Thirdly their work is logs are stored for different time frames in some cases these duration are regulated:
Applications logs - are stored for some weeks for analysts to review data flows (un regulated)
Analytics log: are also stored for a few weeks (time, quality, descriptive statistics, confusion matrices etc).
Audit logs of decisions and why they were made are regulated and need to be stored for 7 year.
I find this vaguely amusing considering the Karkaeque nature of almost all decision made at bank being undisclosed non-transparent, and impossible to discover even who the decision makers are.
Talk video (hebrew):
References:
Looks like the material in the first talk is based on this blogpost. However the talk had been expanded and updated since the post.
SUL is Wikipedia's Single User Login system The goal is to use it to authenticate moodle users there is one caveat - what happen if the moodle user has no account. Moodle and MediaWiki account creation require different information. Since we'd like to use MediaWiki's standard which is highly permissive. It is necessary to change moodle's requirements. Challenges Moodle does not explicitly define an object for setting registration requirements. This needs to be accomplished by admins via edit a number of files manually. errors will place the users in a limbo/blocked state... This is implemented differently on different version of moodle. This will break if the changes are overwritten by a system is upgrade. Directions Ideally Moodle should have a registration policy object which allows the admin to define which fields are required and if they need to be unique . For example Moodle complains if different users share an email adr
I wanted to download some course material on RL shared by the author via Google drive using the command line. I got a bunch of stuff using wget a folder in google drive was a challenge. I looked it up in SO which gave me a hint but no solution. I installed gdown using pip and then used: gdown --folder --continue https://drive.google.com/drive/folders/1V9jAShWpccLvByv5S1DuOzo6GVvzd4LV if there are more than 50 files you need to use --remaining-ok and only get the first 50. In such a case its best to download using the folder using the UI and decompress locally. Decompressing from the command line created errors related to unicode but using the mac UI I decompressed without a glitch.
AWS CloudFormation Pros and Cons So I'm building a PAAS product that does ML based optimisations and that means doing work in the cloud. The ML is a neat feature but without the basic product nothing will happen and to bootstrap this project on AWS I tried to make use of CloudFormation a service that automates creation and destruction of service stacks. Based on a week's worth of experimenting with CloudFormation I will try to answer the question: "Is learning CloudFormation worth the effort?" Despite the rant CloudFormation support creation, updating and deletion of entire stacks of services. SAM is built on top of CloudFormation and It has a visual editor. The way CloudFormation is described, is that you can copy paste snippets to create resources and build a library of reusable components. This is a simplistic point of view. In reality you need to bring properties, specify dependencies, and introduce signalling mechanisms to ensure your template works. T
Comments
Post a Comment