This blog is where I share insights from any number of adventures in data analysis.
I will cover best practices from modeling and analysis projects; share tips on using new tools; outline new projects; recount war stories from Wikipedia and other FOSS projects I contribute to, and discuss challenges on information retrieval challenges, natural language processing tricks and game theoretic insights, portfolio analysis, social network analysis.
I wanted to download some course material on RL shared by the author via Google drive using the command line. I got a bunch of stuff using wget a folder in google drive was a challenge. I looked it up in SO which gave me a hint but no solution. I installed gdown using pip and then used: gdown --folder --continue https://drive.google.com/drive/folders/1V9jAShWpccLvByv5S1DuOzo6GVvzd4LV if there are more than 50 files you need to use --remaining-ok and only get the first 50. In such a case its best to download using the folder using the UI and decompress locally. Decompressing from the command line created errors related to unicode but using the mac UI I decompressed without a glitch.
Get link
Facebook
Twitter
Pinterest
Email
Other Apps
PyData 13
Get link
Facebook
Twitter
Pinterest
Email
Other Apps
1st speaker JP Morgan
Continuous Delivery in Python on a Massive Scale, by Or Been-Zeev (JP Morgan) delivery at JP Morgan
Abstract:
J.P. Morgan has one of the largest Python codebases in the world. We will discuss the challenges of working with millions of lines of Python and how one can deal with those. We will also show you how Python makes it easy to achieve continuous delivery and ”push to production” approaches regardless of scale.
My notes:
CD = CI + Push to production
20 million lines of code - use a monolithic code base...
time to market is the KPI
but how to avoid breaking the code many times a day?
Python simplifies the typical CI pipeline as there is no compile or build
They have a single head but not clear about how they are merging changes - they have shared staging layers to handle this issue.
Speaker separation in the wild, and the industry's view - Rapahel Cohen (Chorus.ai)
Abstract:
Audio recordings are a data source of great value used for analyzing conversations and enabling digital assistants. An important aspect of analyzing single-channel audio conversations is identifying who said what, a task known as speaker diarization. The task is further complicated when the number of speakers is a priori unknown. In this talk we’ll dive into the deep learning research of speaker "embedding" (verification and diarization). Connect it to a framework of “real” speaker separation needs (as in audio applications such as Chorus.ai’s Conversation Analytics platform), and present the pipeline required for integrating these solutions in a semi supervised manner without requiring any effort by the end user
My Notes
conversation are 10 to minutes
task 1: identify consecutive speaking range by some speaker.
task 2 : given a labeled sample label range.
Sounds like a simplification of the cocktail party problem
you might remember this from Andrew NG course lecture 1
Extra Tasks:
find and share features
produces call summary
generates todo list (actionable analytics)
voice metrics - sentiment etc. (e.g. whatson)
Provide guidance
DWH
Store sales conversation as a database - for future query
Proprietary tech:
Speech recognition
who said what?
Prior Art
EigenVoices in scifi (predates Shazam by 5-6 years !?)
iVector - simple concept but complex paper & many implementation details.
replaced by Deep learning + Softmax classification architecture instead
Large Softmax issue - handled based on Le-Cuns idea of a "Siamese network"
instead of detecting who is talking
check if it is the same or different speaker then
we need the big SoftMax just once per speaker's utterance.
Since different people sound different a Siamese network quickly learn a fit and later does not generalise very well. (This is actually an issue of imbalance in the dataset as segments used are short and switches between speakers are rare...)
They used triplet ($speaker_1, speaker_1, speaker_2$) etc. to teach the network about speaker boundaries.
Some slides:
initial segmentation - segment using a 2 second window
The overall architecture
i-vector is based on Dehak et all 2011 - complicated
DNN to the rescue!
Big SoftMax !?! - so they use Siamese architecture
Which won't generalise too well (highly unbalanced DS) sample triplets with 2 speakers to the rescue (Li et all 2017 Baidu)
Can we do better (Google paper -? which)
The overall architecture
smarter distance metric via PLDA!
Automated Extractions for Machine Generated Mail, by Irena Grabovitch-Zuyev (Yahoo Research)
Abstract:
A few months ago I presented Xcluster - a technique for clustering of machine generated emails and we focused on the classification use case.
Well, now that we have those classified clusters, what else can we gain from it? In this follow-up talk I will present our solution to the Mail extraction task, whose objective is to extract valuable data from the content of mail messages. This task is key for many types of applications including re-targeting, mail search, and mail summarisation, which utilises the important personal data pieces in mail messages to achieve their objectives. The heart of our solution is an offline process that leverages the structural mail-specific characteristics of the clustering, and automatically creates extraction rules that are later applied online for each new arriving message. This process has been productised in Yahoo mail backend and has been tested in large-scale experiments carried over real Yahoo mail traffic.
My Notes:
This talk how Systems like Google Inbox and in particular Yahoo Mail handle grouping and smart processing of emails. Inbox does smart clustering that seems to go beyond a simple bag of words. Also they are able to extract the most salient facts and present them. While parsing is the traditional approach they paper below explains how this type of work is scaled up.
Look at the structure and hash it AKA X-Cluster. Within each X-Cluster
extract text as x-path - creates tables
some paths will be constants
others will be different
Use rule extraction
dictionary based (names, places, ... ) need only to be a 70% hit a dictionary to annotate
output is a regex
Rule refinement
use classification
use xpath previously ignored...
Features (light annotations)
relative xpath postion
annotation before/after
constant values before/after
HTML headers\
The we have a contextual ...
Ecaluation
Beyond A-B testing in the AdTech industry - Uri Goren (Bigabid)
Abstract:
A-B Testing is the default evaluation method used all across the advertising industry. However, despite the simplicity of A-B testing, it is not a silver bullet suitable for all scenarios. We would cover several flavours of A-B testing and their applications, and their upsides and downsides. We would introduce a Bayesian model to tackle some of the issues, and cover the "conjugate_prior" pypi module that is authored by the speaker.
A/B testing is a great subject for speakers to demonstrate their level of sophistication. Uri Goren did about as well as I've heard - Kudos! He has misses some of the big issues but avoids most of the sand pits:-) while covering the terrain.
bio - impressive
All data scientists end up working on CTR!?
mentioned avoiding confounding factors by limiting test scope.
the reason by we can assume a normal distribution however we soon see the distribution is highly skewed
how long before stopping. explained p-value
so the data is far from normal.
we want to estimate CTR probability using Bernoulli Distribution
This auction pricing slide shows that CPC is based on ctr ... (aren't we missing the next bid) ...)
Any prior might converge eventually ... but if there is a congregate prior, it is the best choice. Also introducing a new Python package with bayesian Monte Carlo simulation for a/b tests, (which lets us guestimate the remaining probability of a win for the A or B)
using a conjugate prior (as it fits the posterior) the package matches posteriors with priors :-)
My Notes
combing features requires a factorial design
p value - is the chance of getting the same result in a A/A test
like others before talks about N shows 30 is good for a uniform diftribution
when do we stop - in ad-tech cross validation ?
Stratified cross validation.- did not talk about it
Bernoulli is better but ...
Bayesian one armed bandits save you money that would be lost on the worst branch of the test while running the test.
Asked how do they know the test has run its course and/or validate the results....
some answers:
recommended tutorial on bayesian for data science - Robert Downey's Think Bayes!
SUL is Wikipedia's Single User Login system The goal is to use it to authenticate moodle users there is one caveat - what happen if the moodle user has no account. Moodle and MediaWiki account creation require different information. Since we'd like to use MediaWiki's standard which is highly permissive. It is necessary to change moodle's requirements. Challenges Moodle does not explicitly define an object for setting registration requirements. This needs to be accomplished by admins via edit a number of files manually. errors will place the users in a limbo/blocked state... This is implemented differently on different version of moodle. This will break if the changes are overwritten by a system is upgrade. Directions Ideally Moodle should have a registration policy object which allows the admin to define which fields are required and if they need to be unique . For example Moodle complains if different users share an email adr
I wanted to download some course material on RL shared by the author via Google drive using the command line. I got a bunch of stuff using wget a folder in google drive was a challenge. I looked it up in SO which gave me a hint but no solution. I installed gdown using pip and then used: gdown --folder --continue https://drive.google.com/drive/folders/1V9jAShWpccLvByv5S1DuOzo6GVvzd4LV if there are more than 50 files you need to use --remaining-ok and only get the first 50. In such a case its best to download using the folder using the UI and decompress locally. Decompressing from the command line created errors related to unicode but using the mac UI I decompressed without a glitch.
AWS CloudFormation Pros and Cons So I'm building a PAAS product that does ML based optimisations and that means doing work in the cloud. The ML is a neat feature but without the basic product nothing will happen and to bootstrap this project on AWS I tried to make use of CloudFormation a service that automates creation and destruction of service stacks. Based on a week's worth of experimenting with CloudFormation I will try to answer the question: "Is learning CloudFormation worth the effort?" Despite the rant CloudFormation support creation, updating and deletion of entire stacks of services. SAM is built on top of CloudFormation and It has a visual editor. The way CloudFormation is described, is that you can copy paste snippets to create resources and build a library of reusable components. This is a simplistic point of view. In reality you need to bring properties, specify dependencies, and introduce signalling mechanisms to ensure your template works. T
Comments
Post a Comment