Monday, April 30, 2018

PyData 13

1st speaker JP Morgan

Continuous Delivery in Python on a Massive Scale, by Or Been-Zeev (JP Morgan) delivery at JP Morgan


J.P. Morgan has one of the largest Python codebases in the world.  We will discuss the challenges of working with millions of lines of Python and how one can deal with those. We will also show you how Python makes it easy to achieve continuous delivery and ”push to production” approaches regardless of scale.

My notes:

  •  CD = CI + Push to production
  • 20 million lines of code - use a monolithic code base...
  • time to market is the KPI 
  • but how to avoid breaking the code many times a day?
  • Python simplifies the typical CI pipeline as there is no compile or build
  • They have a single head but not clear about how they are merging changes - they have shared staging layers to handle this issue.

Speaker separation in the wild, and the industry's view - Rapahel Cohen (


Audio recordings are a data source of great value used for analyzing conversations and enabling digital assistants. An important aspect of analyzing single-channel audio conversations is identifying who said what, a task known as speaker diarization. The task is further complicated when the number of speakers is a priori unknown. In this talk we’ll dive into the deep learning research of speaker "embedding" (verification and diarization). Connect it to a framework of “real” speaker separation needs (as in audio applications such as’s Conversation Analytics platform), and present the pipeline required for integrating these solutions in a semi supervised manner without requiring any effort by the end user

My Notes

  • conversation are 10 to minutes
  • task 1: identify consecutive speaking range by some speaker.
  • task 2 : given a labeled sample label range.
  • Sounds like a simplification of the cocktail party problem 
you might remember this from Andrew NG course lecture 1
  • Extra Tasks:
    • find and share features
    • produces call summary
    • generates todo list (actionable analytics)
    • voice metrics - sentiment etc. (e.g. whatson)
    • Provide guidance
  • DWH
    • Store sales conversation as a database - for future query
  • Proprietary tech:
    • Speech recognition
      • who said what?
  • Prior Art
    • EigenVoices in scifi (predates Shazam by 5-6 years !?)
    • iVector - simple concept but complex paper & many implementation details. 
    • replaced by Deep learning + Softmax classification architecture instead

  • Large Softmax issue - handled based on Le-Cuns idea of a "Siamese network" 
    • instead of detecting who is talking 
    • check if it is the same or different speaker then 
    • we need the big SoftMax just once per speaker's utterance.
  • Since different people sound different a Siamese network quickly learn a fit and later does not generalise very well. (This is actually an issue of imbalance in the dataset as segments used are short and switches between speakers are rare...) 
  • They used triplet ($speaker_1, speaker_1, speaker_2$) etc. to teach the network about speaker boundaries.
Some slides:

initial segmentation - segment using a 2 second window
The overall architecture

i-vector is based on Dehak et all 2011 - complicated

DNN to the rescue!
Big SoftMax !?! - so they use Siamese architecture 
Which won't generalise too well (highly unbalanced DS)
sample triplets with 2 speakers to the rescue (Li et all 2017 Baidu)
Can we do better (Google paper -? which)
The overall architecture
smarter distance metric via PLDA!

Automated Extractions for Machine Generated Mail, by Irena Grabovitch-Zuyev (Yahoo Research)


A few months ago I presented Xcluster - a technique for clustering of machine generated emails and we focused on the classification use case.
Well, now that we have those classified clusters, what else can we gain from it?
In this follow-up talk I will present our solution to the Mail extraction task, whose objective is to extract valuable data from the content of mail messages.
This task is key for many types of applications including re-targeting, mail search, and mail summarisation, which utilises the important personal data pieces in mail messages to achieve their objectives. The heart of our solution is an offline process that leverages the structural mail-specific characteristics of the clustering, and automatically creates extraction rules that are later applied online for each new arriving message. This process has been productised in Yahoo mail backend and has been tested in large-scale experiments carried over real Yahoo mail traffic.

My Notes:

  • This talk how Systems like Google Inbox and in particular Yahoo Mail handle grouping and smart processing of emails. Inbox does smart clustering that seems to go beyond a simple bag of words. Also they are able to extract the most salient facts and present them. While parsing is the traditional approach they paper below explains how this type of work is scaled up. 
  • Talk covers the paper: 
  • Look at the structure and hash it AKA X-Cluster. Within each X-Cluster
    • extract text as x-path - creates tables 
    • some paths will be constants
    • others will be different
  • Use rule extraction
    • dictionary based (names, places, ... ) need only to be a 70% hit a dictionary to annotate
    • output is a regex
  • Rule refinement
    • use classification
    • use xpath previously ignored...
    • Features (light annotations)
      • relative xpath postion
      • annotation before/after
      • constant values before/after
      • HTML headers\
    • The we have a contextual ...
  • Ecaluation

Beyond A-B testing in the AdTech industry - Uri Goren (Bigabid)


A-B Testing is the default evaluation method used all across the advertising industry.
However, despite the simplicity of A-B testing, it is not a silver bullet suitable for all scenarios.
We would cover several flavours of A-B testing and their applications, and their upsides and downsides. We would introduce a Bayesian model to tackle some of the issues, and cover the "conjugate_prior" pypi module that is authored by the speaker.

A/B testing is a great subject for speakers to demonstrate their level of sophistication. Uri Goren did about as well as I've heard - Kudos! He has misses some of the big issues but avoids most of the sand pits:-) while covering the terrain.

bio - impressive
All data scientists end up working on CTR!?
mentioned avoiding confounding factors by limiting test scope.

the reason by we can assume a normal distribution
however we soon see the distribution is highly skewed

how long before stopping. explained p-value
so the data is far from normal.
we want to estimate CTR probability using Bernoulli Distribution

This auction pricing slide shows that CPC
is based on ctr ... (aren't we missing the next bid) ...)

Any prior might converge eventually ... but if there is a congregate
prior, it is the best choice. Also introducing a new Python package with
bayesian Monte Carlo simulation for a/b tests, (which lets us guestimate
the remaining probability of a win for the A or B)
using a conjugate prior (as it fits the posterior)
the package matches posteriors with priors :-)

My Notes

  • combing features requires a factorial design
  • p value - is the chance of getting the same result in a A/A test
  • like others before talks about N shows 30 is good for a uniform diftribution
  • when do we stop - in ad-tech cross validation ? 
  • Stratified cross validation.- did not talk about it
  • Bernoulli is better but ...
  • Bayesian one armed bandits save you money that would be lost on the worst branch of the test while running the test.
  • Asked how do they know the test has run its course and/or validate the results....
some answers:
 recommended tutorial on bayesian for data science - Robert Downey's Think Bayes!

Monday, February 19, 2018

Insight into progressive web apps

Some notes from a Meetup on PWAs in January 2016. I feel quite knowledgable on PWA but I wanted to learn more on implementing service worker. I ended up adding some research and collecting some great resources.

However I ended up getting more detailed materials on the service worker based on google's developers docs. Also the resources have been expanded.

Service worker

Service Workers Its just a simple JavaScript file that sits between you and the network
– It runs in another thread
– It has no access to DOM
– It intercepts every network request (including cross domain)

Entry point: self.caches (in service worker) or window.caches (on page)


Registering a Service Worker

• Works with promises
• Re-registration works fine

In main.js

Setting an interception scope

The default scope is where the sw file is, but you can control that

navigator.serviceWorker.register('/sw.js',{scope: '/my-app/'});

It will then control /my-app/ and its subdirectories

On install

  • Add initial resources (the application shell) to the cache
  • Cache has a name
  • Array of resources to cache
  • Mechanism to get a resource by path (a map)

console.log('Service Worker Registered!');
// This function build an array of urls,
// fetch them, and store the responses in the cache,
// example: key: 'main.js' value: 'alert(3)'

var cacheName = 'app-shell-cache-v1';
var filesToCache = ['/', '/index.html', ...];
self.addEventListener('install', event => {
 event.waitUntil( => {
       return cache.addAll(filesToCache);   //load app shell into the cache.
   }).then(() => {
     return self.skipWaiting();

The install should happen in the background in case there is a previous version of the service worker running. If the install fails the old service worker will be left intact.

On activate

Update cache - remove outdated resources. Cache should be versioned. If the sum off all caches is to big for an origin point is too big they may be reclaimed. So we should ensure th remove old data. This is done more easily if we use versioned caches.

self.addEventListener('activate',e => {
   caches.keys().then(keyList => {
     return Promise.all( => {
       if (key !== cacheName) return caches.delete(key);
 return self.clients.claim();

On fetch

Retrieve from cache with network fallback
Allows to intercept page loading
Can get page from the cache or from network,
Handle offline and 404 with exception

self.addEventListener('fetch', event => {
    .then(response => {
      return response || fetch(event.request); //return cached else fetch

How this is handled in practice depends on resources and their rate of change. So the shell might be fetched from cache first
news might be fetched from the network and fall back to the cache if offline.

Serving files from the cache

Cache falling back to network

As above

self.addEventListener('fetch', function(event) {
   .then(function(response) {
     return response || fetch(event.request);

Network falling back to cache

Frequently updated data with fallback to cache - say for news where we have an older feed.

self.addEventListener('fetch', function(event) {
   fetch(event.request).catch(function() {
     return caches.match(event.request);

Cache then network

For resources that update frequently and are not versioned in the shell
E.g. (articles, avatars, social media timelines, game leader boards)
Requires 2 requests - one to cache and one to the network.

Note this code goes in the main script not the SW as it is … reactive

var networkDataReceived = false;
var networkUpdate = fetch('/data.json')
.then(function(response) {
 return response.json();
}).then(function(data) {
 networkDataReceived = true;

Next we look for the resource in the cache. This will usually respond faster than the network request. We use the cached data to provide a quick response. If the network provides newer data we update again. If cache fails we try to get from the net

caches.match('/data.json').then(function(response) {
 return response.json();
}).then(function(data) {
 if (!networkDataReceived) {
}).catch(function() {
 return networkUpdate;

Generic fallback

Here is a version with a generic fallback to an offline mode if network fails

self.addEventListener('fetch', function(event) {
   caches.match(event.request).then(function(response) {
     return response || fetch(event.request);
   }).catch(function() {
     return caches.match('/offline.html');

Progressive web app use a manifest to setup an icon on mobile.

In html:

<link rel="manifest" href="/manifest.json">

Sample WebApp Manifest:

 "name": “Tovli",
 "short_name": “TovliWeb",
 "start_url": ".",
 "display": "standalone",
 "background_color": "#fff",
 "description": “Feel better today",
 "icons": [{
"src": "images/homescreen48.png",
"sizes": "48x48",
"type": "image/png"

Cache storage limits

Chrome and Opera
No limit.
Storage is per origin not per API
No limit.
Prompts after 50 MB
Mobile Safari

Desktop Safari
No limit.
Prompts after 5MB
Internet Explorer (10+)
Prompts after 10MB

The PWA Checklist

• Site is served over HTTPS (localhost permitted)
• Pages are responsive on tablets & mobile devices
• Site works cross-browser
 • Each page has a URL
• Page transitions don't feel like they block on the network
 • The start URL (at least) loads while offline
 • Metadata provided for Add to Home screen
 • First load fast even on 3G
 • See the full checklist

PWA with Vue

vue pwa app

npm install -g vue-cli
vue init pwa my-project
cd my-project
npm install
npm run dev

once done use NPM to ...

Run the app in development mode:
npm run dev

Build for production (uglify, minify etc):
npm run build

Run unit test with karma+ mocha + karma-webpack:
npm run unit

Run end to end test with night watch:
npm run e2e


Wednesday, January 17, 2018

Streaming events to BigQuery

Here are my notes from "Streaming events with Kafka to BigQuery and Logging" Meeting of Big Things Meetup which took place at Poalim Tech Offices. The work space is quite amazing and there were many people working as late as 10 PM.

Poalim Tech has is a great host for medium sized meetups. On the left, the obligatory pizza overdose

BigQuery - a serverless analytics
warehouse is the destination for the data.

Google BigQuery in brief

BigQuery is Google's Analytics serverless database  solution based on colosul and providing as yet unmatched scaling capabilities. Usage cost are typically 5 USD per TB processed.

Pros are: 

  • serverless data warehouse solution.
  • a powerful command line interface.
  • an SQL based interface with noSql performance.
  • Good code examples.

Cons are: 

  • that queries can eat up many thousands of USD of compute time.
  • clunky web interface.

Apache Kafka in brief

Kafka is used to build streaming data pipelines
Apache Kafka is a highly performant free and open source message broker which allows asynchronous communication between consumer and producers of messages (messages in this case are web service based function calls). Transition to microservice architecture as well as speed and scaling concerns have made Kafka a key component in the modern enterprise's real time and streaming pipelines.

Streaming events

MyHeritage - stream change data from
 Kafka into Google BigQuery
The first talk  which was titled " From Kafka to BigQuery - A Guide for Streaming Billions of Daily events" by Ofir Sharony (a BackEnd Tech-Lead at MyHeritage).  was about how my heritage use which use Kafka get their data into BigQuery. At a previous talk by MyHeritage engineers, they covered how they converted their monolith into microservices. Mr Sharony points that there are two type of data being placed into BigQuery. The first is database (the family trees created by the clients) and a second class is called "Change Data" which is an association of  microservice events log together in context of web analytics of client session. This is becoming de facto the way microservices architecture are debugged.

Mr. Sharony outlines four iteration of Kafka to BigQuery Integration each with progressively simple architecture. They are as follows:

Take 1: Batching data to GCS

based on Secor and Google cloud storage.

Take 2: Streaming with BigQuery API

This iteration use he big query API. This was droppped since it required extensive error and exception handing which could be avoided using a kafka connector

Take 3: Streaming with Kafka Connect

He used an open-sourced Connector implemented by WePay but there were a number of issues,  the  BigQuery connector could only ingest data by its processing time which led some to data in the wrong partitions, and could not split events from a single stream to specific BigQuery tables and so they tried another solution:

Take 4: Streaming with Apache Beam

Talk video (hebrew):


The team have already mostly migrated from SAS to R and are also migrating to python. But the Data Science team at Bank Hapoalim tech have a number of unusual requirements and challenges.
First usually do not have to production level data. Any data they can uses must be sanitized of any sensitive PII (Personally Identifying Information).

Bank Hapoalim data scientists uses a HortonWorks
suite of 30 + open source software 
Secondly they a suite of 30 of open source software tools from Hortonworks. But the software in use is based on a support contract from a vendor who changes the support of different software every year. This means that some software choices are unpopular forks of mainstream projects a decision made by the vendor rather than the data science team Also looking at some of the choices there is a significant effort to lock clients to these choices and away from competing stacks... Also since Hortonworks just provides support many of the tools in the stack are badly co-integrated. (Multiple SOLR instances with irregular levels of support for orchestration) Ofcourse to fully integrate the tools HortonWorks would need to have employ FOSS developers of sufficient standing in each projects so as to enable architectural changes for these integrations.

Thirdly their work is logs are stored for different time frames in some cases these duration are regulated:
  • Applications logs - are stored for some weeks for analysts to review data flows (un regulated)
  • Analytics log: are also stored for a few weeks (time, quality, descriptive statistics, confusion matrices etc).
  • Audit logs of decisions and why they were made are regulated and need to be stored for 7 year.
I find this vaguely amusing considering the Karkaeque nature of almost all decision made at bank being undisclosed non-transparent, and impossible to discover even who the decision makers are.

Talk video (hebrew):


  • Looks like the material in the first talk is based on this blogpost. However the talk had been expanded and updated since the post.
  • A slide deck for the first talk is here.

Monday, January 15, 2018

How to search your youtube history from the command line in ?

So I eventually found that the personal Youtube search url is at:

A nice ui no doubt but I need to get at from the command line...

To avoid breaking the URI we must ensure the query text is url encoded so:

urlencode() {
    # urlencode <string>
    local length="${#1}"
    for (( i = 0; i < length; i++ )); do
        local c="${1:i:1}"
        case $c in
            [a-zA-Z0-9.~_-]) printf "$c" ;;
            *) printf '%%%02X' "'$c"

with that:

#youtube history command
    urlencode | google-chrome "'${*//[$'\t\r\n ']}'&restrict=ytw"

p.s. all these go in a dotfile say at ~/Dotfiles/.functions and then sourced via:

$source ~/Dotfiles/.functions

so to look up the legendary session "Willy Wonka of Containers - Jessie Frazelle"

I need only type:

$yth Willy Wonka

And I instantly achieve container nirvana at #ContainerCamp.


Saturday, December 30, 2017

Automating web app development with Polymer and Yeoman

Yeoman lets you configure and
 stamp out sophisticated boilerplate
projects from the command line.
In a digital marketing agency  the data science team may be asked to provide for each client's campaign fairly similar  media reports and explanatory analytics dashboards for both external clients and internal clients. For longer term project we may be asked to also provide predictive analytics. The data comes from advertisers Google, Facebook, Taboola, OutBrain, phone tracking metrics API, outcomes are channels via segment into say Google Analytics which has both an API and polymer components. Usually there will be additional data science products prediction special segment data, funnels, market research, attribution charts which and long term data in BigQuery which has an API as well. Some vendors don't have an API so to access their data it is exported into a CSV and placed into google sheets which has an API and a Polymer component phone tracking. It takes too much work and time to code all these dashboards unless the campaign is long term. But when one uses a generator like Yeoman and incrementally add each part the creation of much of these dashboards can be automated reducing time and complexity for creating these solutions.

A second use case for using this combo is creating html5 banners. Often campaigns require creating many simail banners especially if you want to scientifically optimise your creative using a fractional factorial experimental design. Here is the Buildabanner Yeoman generator


Polymer together with Yeoman can help  kick start new web project with an opinionated fully baked tooling infrastructure. Each new edition of polymer has many  changes and different tooling.  Yeoman is not very well documented and challenging to integrate into the increasingly automated build formats most CLI use today. So while these have fairly steep learning curves which may make it difficult to justify  the return on time invested for tooling etc. But together if you have many similar projects planned, or are building a self-serve system Yeoman and polymer may be just the right fit.

Yeoman is an automation tool for creation of a web project. The more structured your projects are the more Yeoman is going to save you time. It also shares with web components the notion of compatibility which can help support complexity. However to make a Yeoman generator bullet-proof may require long term support and fixing bugs which occur on other people's systems.

The two have been combined in the Polymer CLI, though currently you may be interested in the following resources if this is a project you wish to automate. It also allows teams to concentrate knowledge into a generator which will more readily supports additional automation via scripts build tools and a split production and development pipeline as well as CI down the line.

Another issue common to work with large boiler plate projects - few people know what all the boiler plate is doing, how it can be tested or changed. So consider that you should document the project thoroughly.

Resources on accelerating Polymer projects with Yeomen

I cover these because Yeoman is easy to get started with but you soon end up interacting with Bower, NPM as well as other tools which boils down to a steep learning curve.

The Polymer the Javascript library that teaching the browser new tricks. Polymer has had strong ties to Yeoman project - perhaps since the polymath Addy Osmani is on both development teams. Yeoman is described as "The web's scaffolding tool for modern web apps".

Here is an article from 2013 on creating polymer projects using Yeoman, a year later he release the following video:

+Addy Osmani -  Building a Polymer app with Yeoman 2014

YOLOmer! Polymer and Yeoman for lighting fast dev

When Polymer 1.0 was released it was introduced with a couple of  starter  kit projects one for new users and another for power users. This project included some fairly sophisticated use of tooling to provide a plethora of configurable features such as offline support using service workers. At the polymer summit in 2015. Rob Dobson introduced a polymer Yeoman generator that stamps out a psk project

+Rob Dobson - End to End with Polymer from The Polymer Summit 2015

In 2016 polymer introduced routing and layouts components and behaviors. These were released with a Polymer CLI a command line tool based on Yeoman that provides unified install and  access to many tools that a used with Polymer. However, the PSK2 project was not updated to work with these and eventually a much simpler starter project was recommended.

These were introduced by Rob Dobson in 2016 in a couple of Polycast episodes 52 and 53 focused on the PSK2.

Rob Dobson - How to build a CLI generator -- Polycasts #53

Polymer CLI Generators 101 

Polymer CLI allows us to generate components. There are many types of components and creating a working environment with a demo and tests takes lots of work and research. We are also at the cusp between polymer 1 and polymer 2 where es2015 rules supreme.

So what Let's build some generators for custom components in Polymer. This is one use case where working with generators can have a significant payoff in time savings

  • Polymer 2 preview elements
  • es6 polymer element element 
  • psk2 with es6 support
  • an element in psk2 with es6 support
  • a style element
  • a behaviour
  • d3 element etc (integration with libraries that can play with polymer) What are the requirements for a lib so it can be integrated with polymer's data binding model.
  • a PWS project with firebase support

Additional References

Wednesday, December 27, 2017

Serverless Big Data

I was at the first meeting of the p of the Big Data Analytics meetup. The first speaker was Avi Zloof CEO of EvaluteX who gave a talk titled "Serveless Big Data The Good, and the Great"

The Speaker Avi Zloof
EvaluateX which is located at "The Junction" in (Rothschild 9 Tel Aviv) is an outfit that has a chrome plugin which can optimize Google BigQuery SQL queries in the web interface. My Last BigQuery project however had abandoned the web interface and switched to 100% automation via the API. Also despite having massive queries there was little need to optimize them. I had been more concerned with comparing different editions of the projects to detect data discrepancies. The Big Data and GUI connection is often the primary challenge however this was not the subject of the talk.
The talk introduced me to EvaluateX and their activity. Mr. Zloof shared many interesting professional insights as well as his point of view regarding serverless database platforms. Mr Zloof briefly outlined the history of serverless databases, mentioning 

  • Google Big Query
  • AWS Athena
  • Azure Cloud Function
  • Azure clue functions
  • Google Cloud Functions
  • IBM whisk. 

Mr Zloof's primary takeaway message was that the pricing model is the key to correctly evaluating a platform's suitability for a company's business model.' 

The Serverless Databases Platforms for 2018
Next came a  back of the napkins calculation which posited that if processing a TB query cost 5 USD on BigQuery and creates a value of 6 USD, you have a viable business model for working with big data. I felt that while ignoring storage and networking costs might be a flaw in this rough model. However, I cannot deny that reducing the complexities of pricing cloud services is certainly an easier sell to middle and upper Managment than the labyrinthian calculations for pricing real world cloud services to produce BI systems and that this approach distills the costs of processing to their essence.

Mr Zloof noted that when Oracle CTO Larry Alison recently announced in October 2017  his company's entry into the serverless database space, wit Fn such minutiae as the pricing plan had been glossed over. Once the service's pricing is finalised, it will become the deciding factor for evaluating to what extent Oracle's new platform will be competitive in this crowded space.

Price tag per 2013 - Google big query 5 USD per TB. AWS Athena is priced at 5 USD per compressed  TB which can cost a third less than BigQuery. Some other insights were that in terms of scaling and performance although Athena is cheaper than BigQuery it is far less powerful. DynamoDB is a key value store which are not as suitable for analytics work or for general purpose work as say a SQL backend.

Mr Zloof stated that NoSql solutions are adding an SQL interface and that after many years he now feels how SQL is the lingua franca for Big Data systems.  I haven't seen this in, say, Firebase, however superlatives aside, this is definitely a trend in the evolution of NoSQL systems. Google BigQuery for example stated out with a proprietary SQL platform and now supports a more compliant SQL format. But when a NoSql database adds a sql front end it is highly unlikely that it will be as performant as a SQL backend which is where query optimization becomes important.

Another important feature not available from most of the serverless database is "Stop Loss" ability. The term which comes from the finance industry is a command to stop an operation that will have very big charges associated. Most queries need a single scan of the data warehouse - but if your op is polynomial it could run for days on your data and you might not be aware or be able to halt it without intervention from support - which could take hours to respond. So a stop loss capability can be a game changer.

All in all this was a great talk and I hope to be hearing more from this speaker in the future.