Sunday, January 13, 2019

Android Coding Conundrums 1 Fragment Constructors

While researching using the factory design pattern for  fragment creation I couldn't help but notice how that fragment creation is a long term source of bugs. Why is fragment creation error prone?

Perhaps because the API for fragment has been changed so frequently that so much of the advice is dated. In the real world fragment is deprecated in favour of a decedent in the app support library but that has been deprecated as well in-favour of androidX support libraries. 


W3sDesign Factory Method Design Pattern UMLPerhaps it is because many newcomers to Android are Java developers who follow the Java idiom of constructor overloading to pass parameters at creation for use in Activity.onCreate(). However, this is not a good idea it is an example of bug pattern. Using a constructor will usually appear to work fine until Android destroys the activity and looks for a default constructor. If there isn't one the app crashes with a runtime exception. This is because behind the scenes the default constructor is called rather than the constructor provided. The issue is of course that the parameters you use to call the fragments are not known by Android framework.

If there is a default constructor a more subtle bug will will arise. When the fragment is recreated following a runtime or configuration change using the default constructor and without any of the parameters used previously. This is not going to work unless somehow the parameters are set before they are used. Once you try to get at them - you will get a null error exception.  Using getters and setter may help and default values may help a little more but this is not really fixing the issue. 
The default Android mechanism for saving and restoring state is of little use. It only covers states stored in controls. Anything more sophisticated requires attention from the programmer.
If the state has all been saved and restored to a bundle in onPause() and onRestore() respectively this is likely to still not enough.

When the constructor passes resource ids etc required earlier in the lifecycle before the bundle mechanism is run say in onCreateView() or in onCreate(). These fragment may still be crashworthy. Also at this point reproducing the bug may require a rather sophisticated set of scenarios as the bug is still around but harder and harder to reproduce.

I also noticed some mentions of an Android developer setting called "don't keep activities" but in adb it is a global flag which tells Android to call finish on activities once they lose focus. This has the consequence of simulating a configuration change. Once set activities are no longer kept in the task (top level activity container for fragments). This will ensure the fragment's default constructor will be called as if the device is a memory staved device with a brand new equivalent date of 2012. Using this setting we should crash faster and more consistently if the above bug were introduced into the app allowing fixing them.

I spent some time figuring how to control this setting via adb.
This raises another question is how to coordinate tests and adb commands. 
Or better yet how to do the adb voodoo using junit rules.

I hope to cover these in future.

Tuesday, August 14, 2018

Big Data Analytics Israel - New Year, New Data Scientist Job: 5 Things To Think About

Data science interviews can be over whelming 

New Year, New Data Scientist Job: 5 Things To Think About


My notes: https://www.meetup.com/Big-Data-Analytics-Israel/events/253124286/



The first talk was by:

Raya Belinsky - "New job - yes or no?"

The talk about finding your next job or reinventing your current jobs. Miss Belinsky's humour and background as an executive life-coach made this talk both pleasant and worth-while.

She covered her operational definition of job burnout
Linkin profile - complete the profile (it tells you what to do)
The CV - ask 2 people to prepare it
The Interview - e.g. prepare 3 questions

Each had at least a couple of points worth taking care of in your next round of job search. Check out the talk and slides when they go online.




Second talks by:

Nathaniel Shimoni - "Life story"

Mr Shimoni is an experienced story and had a compelling story to tell and his own twiting path to  becoming  a data scientist. 

Some Highlights:
  • Listened 1 hour to lectures during commute.
  • Later he decided to dedicate 1 hour a day to 
  • Participated in many Kaggle competition 
  • Liked: Data Hack
  • Read papers - keep up to date
  • Recommends to take risks you can afford




Third talk:

Omri Allouche "The top mistakes you're making in your Data Science interview"

Mr Allouche also did had an unorthodox track to DS. He used many metaphors from DS which was refreshing in this talk sequence as was his use of compelling visuals.

Ask what will you be doing:
  • Writing code that goes to production
  • Develop new algorithms
  • Be in charge of collect data
  • Work alone / lead others
Don't run away from your super powers.
Don't be the only/first data scientist

Running away from data in data science 
  • Don't skip - Exploratory analysis
  • Unsupervised is cool... don't rush to do supervised models
  • Learning to do proper error analysis - when is the model wrong...
Running away from science in data science 
Use your intuition but learn to say - "I don't know but I would try ... " (2 different solutions)
Mr Allouche - talked about the community and that we could have a conversation about brain storming strange new ideas.

define 
  • Data set
  • Input 
  • Output
  • Your own the loss function yourself

Overconfidence is problem - it says that this person is not going to learn too much.

But my five cents on this interesting lecture is that it does not seem to be grounded in having done lots of interviews or sat in these. Some of his comments were contrarian and his pointers on CVs may be  counterproductive.



Fourth talk: 

Aharon Frazer "The Skills That Make a Great Data Scientist"

Aharon was the the only American Rabbi 🐰DS speaker.
He did his studies in US than was a PHP coder. 
He suggests asking about jobs not being offered - in smaller companies. 
Did BI at "Seeking Alpha" when he was looking for work as a web developer.
Went to Joy tunes and after 4 months was head hunted by FaceBook

He talled about 

data engineering - 
  • Data acessibility
  • Data quality
  • Logging
  • ETL Pipelines
  • Dashboards
  • Alerts
data science - are analysts
  • Indentifing opportunities
  • Product visions Foreccasting
  • Goal setting & tracking
  • Product updates

Before and after analysis - 

Reality is messy

Experiments @ Facebook

Exposures  \
                   ======> Stats Engine ======> Metrics Change 
Metrics      /

Some interviewing notions:

I interviewed many more times than I got jobs

look at problems as 3-d

people look as if it is a text book problem but he is more interested in someone who imagines the problem is really happening.

Show you have the template of the problem in your head.
Model - Looking at errors 
  • Framing the problem first.  - "Here is a metric of success."
Self awareness.
Come to agreement with the interviewers.
Some questions have no great answer but cover a fundamental issue. 
Pros and cons of real world situations.

One note: Mr Frazer had a slide-deck disaster but it only slowed him down a bit and he could talk well without his slides - kudos on that. If you are going to give a talk practice giving it without your slides.

Wednesday, August 8, 2018

Paratroopers Puzzle

Puzzle:

Two paratroopers are dropped onto a practically infinite railway track. Both were given a note with the identical instructions... They both follow the instructions and eventually meet up.
What did the note tell them to do.

Answer:

To drop their para-shoots on the track. Then they should run north 10 steps then switch and run 3 times to the south and switch again and triple and do not stop until they meet or reach the other parachute...

The fun answer:

The standard random walk has the properties related to the normal distribution (which Bernoulli approximates as N approaches infinity). For the random walk the mean position for the random walker is his or her starting point. The variance however grows with the root of the time. So pretty much any random walk would work as a rendezvous strategy - whenever they run past a pub, pop in and do not leave until you are punch drunk is probably as good randomising strategy for the above answer.


For more details you can look at the following entry on Cross Validated.
This is also called the two robots problem.


Monday, April 30, 2018

PyData 13


1st speaker JP Morgan

Continuous Delivery in Python on a Massive Scale, by Or Been-Zeev (JP Morgan) delivery at JP Morgan

Abstract: 

J.P. Morgan has one of the largest Python codebases in the world.  We will discuss the challenges of working with millions of lines of Python and how one can deal with those. We will also show you how Python makes it easy to achieve continuous delivery and ”push to production” approaches regardless of scale.

My notes:


  •  CD = CI + Push to production
  • 20 million lines of code - use a monolithic code base...
  • time to market is the KPI 
  • but how to avoid breaking the code many times a day?
  • Python simplifies the typical CI pipeline as there is no compile or build
  • They have a single head but not clear about how they are merging changes - they have shared staging layers to handle this issue.

Speaker separation in the wild, and the industry's view - Rapahel Cohen (Chorus.ai)

Abstract:

Audio recordings are a data source of great value used for analyzing conversations and enabling digital assistants. An important aspect of analyzing single-channel audio conversations is identifying who said what, a task known as speaker diarization. The task is further complicated when the number of speakers is a priori unknown. In this talk we’ll dive into the deep learning research of speaker "embedding" (verification and diarization). Connect it to a framework of “real” speaker separation needs (as in audio applications such as Chorus.ai’s Conversation Analytics platform), and present the pipeline required for integrating these solutions in a semi supervised manner without requiring any effort by the end user

My Notes


  • conversation are 10 to minutes
  • task 1: identify consecutive speaking range by some speaker.
  • task 2 : given a labeled sample label range.
  • Sounds like a simplification of the cocktail party problem 
you might remember this from Andrew NG course lecture 1
  • Extra Tasks:
    • find and share features
    • produces call summary
    • generates todo list (actionable analytics)
    • voice metrics - sentiment etc. (e.g. whatson)
    • Provide guidance
  • DWH
    • Store sales conversation as a database - for future query
  • Proprietary tech:
    • Speech recognition
      • who said what?
  • Prior Art
    • EigenVoices in scifi (predates Shazam by 5-6 years !?)
    • iVector - simple concept but complex paper & many implementation details. 
    • replaced by Deep learning + Softmax classification architecture instead

  • Large Softmax issue - handled based on Le-Cuns idea of a "Siamese network" 
    • instead of detecting who is talking 
    • check if it is the same or different speaker then 
    • we need the big SoftMax just once per speaker's utterance.
  • Since different people sound different a Siamese network quickly learn a fit and later does not generalise very well. (This is actually an issue of imbalance in the dataset as segments used are short and switches between speakers are rare...) 
  • They used triplet ($speaker_1, speaker_1, speaker_2$) etc. to teach the network about speaker boundaries.
Some slides:

initial segmentation - segment using a 2 second window
The overall architecture

i-vector is based on Dehak et all 2011 - complicated

DNN to the rescue!
Big SoftMax !?! - so they use Siamese architecture 
Which won't generalise too well (highly unbalanced DS)
sample triplets with 2 speakers to the rescue (Li et all 2017 Baidu)
Can we do better (Google paper -? which)
The overall architecture
smarter distance metric via PLDA!

Automated Extractions for Machine Generated Mail, by Irena Grabovitch-Zuyev (Yahoo Research)

Abstract:

A few months ago I presented Xcluster - a technique for clustering of machine generated emails and we focused on the classification use case.
Well, now that we have those classified clusters, what else can we gain from it?
In this follow-up talk I will present our solution to the Mail extraction task, whose objective is to extract valuable data from the content of mail messages.
This task is key for many types of applications including re-targeting, mail search, and mail summarisation, which utilises the important personal data pieces in mail messages to achieve their objectives. The heart of our solution is an offline process that leverages the structural mail-specific characteristics of the clustering, and automatically creates extraction rules that are later applied online for each new arriving message. This process has been productised in Yahoo mail backend and has been tested in large-scale experiments carried over real Yahoo mail traffic.

My Notes:

  • This talk how Systems like Google Inbox and in particular Yahoo Mail handle grouping and smart processing of emails. Inbox does smart clustering that seems to go beyond a simple bag of words. Also they are able to extract the most salient facts and present them. While parsing is the traditional approach they paper below explains how this type of work is scaled up. 
  • Talk covers the paper: 
  • Look at the structure and hash it AKA X-Cluster. Within each X-Cluster
    • extract text as x-path - creates tables 
    • some paths will be constants
    • others will be different
  • Use rule extraction
    • dictionary based (names, places, ... ) need only to be a 70% hit a dictionary to annotate
    • output is a regex
  • Rule refinement
    • use classification
    • use xpath previously ignored...
    • Features (light annotations)
      • relative xpath postion
      • annotation before/after
      • constant values before/after
      • HTML headers\
    • The we have a contextual ...
  • Ecaluation

Beyond A-B testing in the AdTech industry - Uri Goren (Bigabid)

Abstract:

A-B Testing is the default evaluation method used all across the advertising industry.
However, despite the simplicity of A-B testing, it is not a silver bullet suitable for all scenarios.
We would cover several flavours of A-B testing and their applications, and their upsides and downsides. We would introduce a Bayesian model to tackle some of the issues, and cover the "conjugate_prior" pypi module that is authored by the speaker.

A/B testing is a great subject for speakers to demonstrate their level of sophistication. Uri Goren did about as well as I've heard - Kudos! He has misses some of the big issues but avoids most of the sand pits:-) while covering the terrain.

bio - impressive
All data scientists end up working on CTR!?
mentioned avoiding confounding factors by limiting test scope.

the reason by we can assume a normal distribution
however we soon see the distribution is highly skewed

how long before stopping. explained p-value
so the data is far from normal.
we want to estimate CTR probability using Bernoulli Distribution

This auction pricing slide shows that CPC
is based on ctr ... (aren't we missing the next bid) ...)

Any prior might converge eventually ... but if there is a congregate
prior, it is the best choice. Also introducing a new Python package with
bayesian Monte Carlo simulation for a/b tests, (which lets us guestimate
the remaining probability of a win for the A or B)
using a conjugate prior (as it fits the posterior)
the package matches posteriors with priors :-)








My Notes

  • combing features requires a factorial design
  • p value - is the chance of getting the same result in a A/A test
  • like others before talks about N shows 30 is good for a uniform diftribution
  • when do we stop - in ad-tech cross validation ? 
  • Stratified cross validation.- did not talk about it
  • Bernoulli is better but ...
  • Bayesian one armed bandits save you money that would be lost on the worst branch of the test while running the test.
  • Asked how do they know the test has run its course and/or validate the results....
some answers:
 recommended tutorial on bayesian for data science - Robert Downey's Think Bayes!

Monday, February 19, 2018

Insight into progressive web apps

Some notes from a Meetup on PWAs in January 2016. I feel quite knowledgable on PWA but I wanted to learn more on implementing service worker. I ended up adding some research and collecting some great resources.

However I ended up getting more detailed materials on the service worker based on google's developers docs. Also the resources have been expanded.





Service worker


Service Workers Its just a simple JavaScript file that sits between you and the network
– It runs in another thread
– It has no access to DOM
– It intercepts every network request (including cross domain)

Entry point: self.caches (in service worker) or window.caches (on page)

 

Registering a Service Worker

• Works with promises
• Re-registration works fine

In main.js
navigator.serviceWorker.register('/sw.js').then(function(reg){
console.log('regsitered');
}.catch(function(err)){
console.log('Boo!');
});

Setting an interception scope

The default scope is where the sw file is, but you can control that

navigator.serviceWorker.register('/sw.js',{scope: '/my-app/'});

It will then control /my-app/ and its subdirectories

On install


  • Add initial resources (the application shell) to the cache
  • Cache has a name
  • Array of resources to cache
  • Mechanism to get a resource by path (a map)

console.log('Service Worker Registered!');
// This function build an array of urls,
// fetch them, and store the responses in the cache,
// example: key: 'main.js' value: 'alert(3)'

var cacheName = 'app-shell-cache-v1';
var filesToCache = ['/', '/index.html', ...];
self.addEventListener('install', event => {
 event.waitUntil(
   caches.open(cacheName).then(cache => {
       return cache.addAll(filesToCache);   //load app shell into the cache.
   }).then(() => {
     return self.skipWaiting();
   })
 );
});

The install should happen in the background in case there is a previous version of the service worker running. If the install fails the old service worker will be left intact.


On activate

Update cache - remove outdated resources. Cache should be versioned. If the sum off all caches is to big for an origin point is too big they may be reclaimed. So we should ensure th remove old data. This is done more easily if we use versioned caches.

self.addEventListener('activate',e => {
 e.waitUntil(
   caches.keys().then(keyList => {
     return Promise.all(keyList.map(key => {
       if (key !== cacheName) return caches.delete(key);
     }));
   }));
 return self.clients.claim();
});


On fetch

Retrieve from cache with network fallback
Allows to intercept page loading
Can get page from the cache or from network,
Handle offline and 404 with exception

self.addEventListener('fetch', event => {
 event.respondWith(
   caches.match(event.request)
    .then(response => {
      return response || fetch(event.request); //return cached else fetch
   })
 );
});

How this is handled in practice depends on resources and their rate of change. So the shell might be fetched from cache first
news might be fetched from the network and fall back to the cache if offline.

Serving files from the cache


Cache falling back to network


As above

self.addEventListener('fetch', function(event) {
 event.respondWith(
   caches.match(event.request)
   .then(function(response) {
     return response || fetch(event.request);
   })
 );
});

Network falling back to cache


Frequently updated data with fallback to cache - say for news where we have an older feed.

self.addEventListener('fetch', function(event) {
 event.respondWith(
   fetch(event.request).catch(function() {
     return caches.match(event.request);
   })
 );
});

Cache then network


For resources that update frequently and are not versioned in the shell
E.g. (articles, avatars, social media timelines, game leader boards)
Requires 2 requests - one to cache and one to the network.

Note this code goes in the main script not the SW as it is … reactive

var networkDataReceived = false;
var networkUpdate = fetch('/data.json')
.then(function(response) {
 return response.json();
}).then(function(data) {
 networkDataReceived = true;
 updatePage(data);
});


Next we look for the resource in the cache. This will usually respond faster than the network request. We use the cached data to provide a quick response. If the network provides newer data we update again. If cache fails we try to get from the net


caches.match('/data.json').then(function(response) {
 return response.json();
}).then(function(data) {
 if (!networkDataReceived) {
   updatePage(data);
 }
}).catch(function() {
 return networkUpdate;
})

Generic fallback


Here is a version with a generic fallback to an offline mode if network fails

self.addEventListener('fetch', function(event) {
 event.respondWith(
   caches.match(event.request).then(function(response) {
     return response || fetch(event.request);
   }).catch(function() {
     return caches.match('/offline.html');
   })
 );
});


Progressive web app use a manifest to setup an icon on mobile.


In html:

<link rel="manifest" href="/manifest.json">

Sample WebApp Manifest:

{
 "name": “Tovli",
 "short_name": “TovliWeb",
 "start_url": ".",
 "display": "standalone",
 "background_color": "#fff",
 "description": “Feel better today",
 "icons": [{
"src": "images/homescreen48.png",
"sizes": "48x48",
"type": "image/png"
 }]
}

Cache storage limits


Browser
Limitation
Notes
Chrome and Opera
No limit.
Storage is per origin not per API
Firefox
No limit.
Prompts after 50 MB
Mobile Safari
50MB.

Desktop Safari
No limit.
Prompts after 5MB
Internet Explorer (10+)
250MB.
Prompts after 10MB

The PWA Checklist

• Site is served over HTTPS (localhost permitted)
• Pages are responsive on tablets & mobile devices
• Site works cross-browser
 • Each page has a URL
• Page transitions don't feel like they block on the network
 • The start URL (at least) loads while offline
 • Metadata provided for Add to Home screen
 • First load fast even on 3G
 • See the full checklist

PWA with Vue


vue pwa app

npm install -g vue-cli
vue init pwa my-project
cd my-project
npm install
npm run dev

once done use NPM to ...

Run the app in development mode:
npm run dev

Build for production (uglify, minify etc):
npm run build

Run unit test with karma+ mocha + karma-webpack:
npm run unit

Run end to end test with night watch:
npm run e2e

Resources


GraphQL with Apollo

My notes from Alexey Kureev's talk titled "Apollo Client: the stuff no-one ever told ya" by  @klarna 
in the React & React Native Meetup.

Meetup link:
REST is very widely used but as web applications have evolved over time with most of the processing happening at the client some of its features are now seen as performance bottlenecks. Some examples are that endpoint are separated as well as the entities but we typically want to query for data representing some relation between endpoint and slices of the entities. With REST this would require multiple requests and getting the full entities. GraphQL lets us do this using a single request and provides a more sophisticated way to make queries.

GraphQL is the evolution... Benchmarks published by facebook claim a 

The slides show how to consume a GraphQL data source using react.

Rest is now vintage - separated endpoint with separated entities
REST cons:
  • it provides deterministic URIs,
  • caching on the HTTP level

GraphQL supports reactive subscription data aggregation that doesn't really fit REST architecture that well.  With GraphQL, we need only one. As the name implies, GraphQL is a graph query language that changes the way you think of data. Instead of operating separated entities, you start to operate on data graphs. Let's take a closer look:

Operation pattern

GraphQL provides three operation types:

  1. Query
  2. Mutation
  3. Subscription

op_type Op_Name (op_params) {
  field_1
  field_2 (field_params) {
    subfields
  }
}

Query


query UserQuery ($id: Int!) {
  user (id: $id) {
    name
    email
  }
}

Mutation


mutation UserNameMutation ($id: Int!, name: String!) {
  editUserName (id: $id, name: $name) {
    name
  }
}

Subscription


subscription UserQuery ($id: Int!) {
  user (id: $id) {
    name
    email
  }
}
GraphQL shopping list example
GraphQL shopping list example - mockup

Shopping cart example


Shopping Cart Query


query ShoppingCartList {
  products {
    id
    title
    preview
    price
  }
}

Details query

 

query ProductInfo($id: Int!) {
  product(id: $id) {
    id
    title
    preview
    price
    isAvailable
    discountValue
    description
  }
}

We can now utilise reusable fragments


Define


fragment preview on Product {
  title
  preview
  price
}

Define


fragment details on Product {
  isAvailable
  discountValue
  description
}

Use


query ShoppingCartList {
  products {
    id
    ...preview
  }
}

Use


query ProductInfo($id: Int!) {
  product(id: $id) {
    id
    ...preview
    ...details
  }
}

But this is going to fetch - from the web


fragments can be cached separately.
so they can save transfer bandwidth.
1. readQuery

const { todo } = client.readQuery({
  query: gql`
    query Product($id: Int!) {
      product(id: $id) {
        title
        preview
      }
    }
  `,
  variables: { id: 5 }
});
2. readFragment

const todo = client.readFragment({
  id: 5, 
  fragment: gql`
    fragment productFragment on Product {
      title
      preview
    }
  `,
});
Memory cache
Only reads from local storage but will fail if some of the query is missing .
3. <Query fetchPolicy="cache-only" />

Only reads from local storage but will fail if some of the query is missing 
<Query 
  query={cacheQuery} 
  variables={{ id }}
  fetchPolicy="cache-only"
>
  {({ data = {} }) => <SomeComponent />
</Query>
for example: Shopping List screen
<Query query={shoppingCartList}>
  {({ loading, data }) =>  (
    <View>
      {loading && <Loading />}
      {!loading && <Products data={data.products} />}
    </View>
  )}
</Query>
But we want to get some data from local cache
and some from the server

1. Cached data query


query ProductInfoCache($id: Int!) {
  product(id: $id) {
    id
    ...preview
  }
}

And display via


<Query 
  query={productInfoCache} 
  variables={{ id }} 
  fetchPolicy="cache-only"
>
  {({ data = {} }) => // ...
</Query>



2. Missing (details) data query


query ProductInfo($id: Int!, $full: Boolean!) {
  product(id: $id) {
    id
    ...preview @include(if: $full)
    ...details
  }
}


Or Declaratively


<Query 
  query={productInfo}
  variables={{ id, full: !data.product }}
>
  ({ loading, data }) => //...
</Query>
Add cache redirect
const cache = new InMemoryCache({
  cacheRedirects: {
    Query: {
      product: (_, { id }) =>
        toIdValue(cache.config.dataIdFromObject({ 
          __typename: "Product", id
        })
      )
    }
  }
});

TRY IT OUT


Optimistic response … Every mutation or update - update is called

Optimistic UIs don’t wait for an operation to finish to update to the final state.

They immediately switch to the final state, showing fake data for the time while the real operation is still in-progress.

Remove item form the cached list

mutate({
  variables: { id },
  optimisticResponse: {
    __typename: 'Mutation',
    removeCartItem: {
      __typename: 'Product',
    },
  },
  update: (proxy) => {
    const data = proxy.readQuery({ query: ShoppingCartList });
    const filteredProducts = data.products.filter(product => product.id !== id);

    proxy.writeQuery({ 
      query: ShoppingCartList, 
      data: {...data, products},
    });
  },
});

P.S. graphics are by shadow.x.q84@gmail.com

Summing up: Alexey Kureev is an amazing story teller, able to take this complex technical topic 
and make it both interesting and exciting. I definitely will be looking into using GraphQL and
Apollo in future projects.

See also: