Friday, November 17, 2017

My first BigQuery DWH

Some notoes about a project that has taken up lots of time recently.

For a media attribution project for a client with sevral hundred campaign I took up the task of migrating a spreadsheet based BI to a more robust script and sql based platform.

Data collection

Data is currently aquired daily via api from media platforms such as Google AdWords and Google Analytics.

I got an initial jumpstart by modifing a sample adwords script with big query writing capabilities .However this was a decision I ultimately regreted but more on that later. And may replace in the second version.

First I added support for google api queries.  The first issues challange was mimatch between fildnames and key formats between datasets.

I added some etl capabilities to the script tonfix these.

Next as working with the big query became unwildy all the sql stored queries were migrated to the script after adding support for view creation. This proved judicious as it became more possibe both to maintain many versions of queries and to dropnthe whole warehouse and recreate it programatically at the press of a button.

Next i needed support for mutiple versions as we were running test on dev staging and production.

Monday, July 24, 2017

Build the foundation faster

I've recently started to work wit foundation.js. My primary goal for work with foundation is to generate very fast mockups. After completing the first project I've taken a bit of time to learn a bit more.

Here are some insights which were difficult to discover within the documentation:

  • #slack is available but not active enough to get answers
  • Foundation is primarily sass or css framework. If you need to do anything more than build a mockup it is probably not going to have any code you need.
  • Using multiple document layouts: 
    1. managed by panini which is really  handlebars.js
    2. can speed up static prototyping. 
    3. may help with PWA using http 2.0 
    4. you need to have a {{> body}} handlebar in the template.
    5. you have to add YAML FrontMatter to to the page (is this mr Jekyll ?)
    6. the YAML needs to reference the layout ie: layout: file-name-without-ext
    7. if you reference missing layout you will get errors.
  • Panini is based on handlebars.js and handlebars may have you covered if you need a little more than the panini documentation mentions - as the documentation is not very comprehensive. So getting familiar with can accelerate your work if you want to go to proof of concept or minimum viable product without investing in a framework.
  • Lorem Ipsum ... and are  your friends - avoid real date as long as possible it will bifurcate the complexity of your project and slow you down.
  • Frontmatter can help you understand and prototype the data needed to introduce to merge with the static markup in order to make the site dynamic. i.e.
    • FrontMatter dataLayer is faster than
    • Mocking data say using faker.js which is faster than
    • Building a backend with fake data which is faster than
    • Building the backend with real data
  • Try to leverage partials
    1.  by splitting the html into its smallest components. This makes sense if your next step is to convert to components.
    2. If you are going to write tests you probably want to work on small isolated units first.
  • You will probably want to use the CLI based setup as it provides a flexible project to work with and allows rapid introduction of "components". Using the framework is covered in some detail at this page.

Monday, August 8, 2016

When one size can't fit all - UI/UX breakpoints

What are UI/UX breakpoints?

Breakpoints in a user experience of user interface design are the width specifications at which layouts change to accommodate larger screens with different layouts on various devices and orientation. Breakpoints are used with column based design drawn from a grid system. In the real world, for example, Android devices highly fragmented screen sizes implies that a single  design's functionality will be challenged to conform with so many different sizes.

Breakpoints essentially simplify the work by grouping many devices together. Within each bucket, a design will have to scale to adjust. But between breakpoints, there may be more radical changes in the interface's functionality.

Infographic for breakpoint from Google's material design site

One best practice of design involves the use of grids. A notable reference is Josef Müller-Brockmann's book "Grid Systems in Graphic Design". This has found itself into web design through libraries such as Bootstrap. However, grids are it is also somewhat implicit in the design of CSS which includes paddings and margins which stack into grids.

Some examples from Google's layout section on "material design" which is the best reference for breakpoints from a practical point of view.
  1. Adjust margins and gutters in multiples of 8dp to maintain a grid as the layout changes.
  2. reveal pattern shows extra elements when space increases -
    • A side navigation is kept off screen for breakpoints that are too narrow for the content. But for wider layouts, the side nav is placed on screen permanently. This also works in master-detail views.
    • Cards that expand their content on request.
  3. Transform a simple element to a more complex one
    • side navigation to page tabs 
    • one-dimensional column list to a two-dimensional grid layout
    • side menus into icons in a toolbar
  4. Dividing a z based UI into new space.
    • a side panel may split into left and right pannels
  5. Reflow
    • Centered grids can be reflowed to address changes in screen size. A
  6. Expand - elements can expand
    • A full-width grid-based design can expand to take more space
    • dialogs can expand proportionally or in specific increments.
  7. Position - elements may move to better positions as space expands
    • FAV may move to a more visible location.
Google's guide recommends seven breakpoints at 480, 600, 840, 960, 1280, 1440, and 1600dp which amounts to eight designs. However, it lists as many as 12 twelve breakpoints. 

The analytics perspective - and media queries

Measuring the success of a design at converting visitors into customer media one of the best practices is to collect data using media queries at say the page level and report the prevalence of different breakpoints, screen types and orientation used to interact with the website or app.

If a web property (web-site or app) has been designed professionally, the responsive layout will  adjust the different layouts for different screen sizes. The breakpoint, aspect ratio, and orientation provide more meaningful segments for capturing online behavior than screen resolution or device model since they will cluster users into the most meaningful cluster. (Though it is still possible to drill down from a breakpoint segment and examine specific devices for issues)

Breakpoints resolution and orientation and can be captured via auto tracking and three custom dimensions.

These segments can be used to better understand success and failure of processes that are inherently caused by poor or serendipitous design choices and adjustment can be made to improve defects or ta add missing breakpoints or to improve the design incrementally.


px = dp * ( dpi /160) or dp = (width in px *160) / screen density
  • % - relative to the enclosing unit (a CSS unit).
  • ch - length relative to the width of the "0" (a CSS unit).
  • em  - length relative to the current font's height. (a CSS unit).
  • ex - size relative to the current font's height by the lower case letter x (a CSS unit).
  • px - length in pixels which are 1/96  in(a CSS unit).
  • in - inches (a cssCSS and android unit).
  • mm - millimeters (a CSS and android unit).
  • pc - pica is  12 points (a CSS unit) used in print.
  • pt - points 1/72 0f an inch (a CSS and android unit).
  • dpi - dots per inch - the screen density of the device.
  • dp or dip - density independent pixels (relative to 160 dpi screen) and bucketed by to 120 (ldpi), 160 (mdpi), 240 (hdpi), 320 (xdpi), 480 (xxhdpi) 640 (xxxhdpi). (an Android unit).
  • sp - scale independent pixels - scaled according to font size (an Android unit).
  • rem - relative to the font size of the root element which ignores scalings of intermediate styles   (a CSS unit).
  • vh - 1% of browser's window size height (a CSS unit).
  • vw - 1% of  browser's window size width (a CSS unit).
  • vmin - the smaller of 1 vh and 1 vh (a CSS unit).
  • vmax- the larger of 1 vh and 1 vh  (a CSS unit).
Of course, these were all initially supported by different versions of each brand of browser.


Wednesday, July 13, 2016

Wikimania 2016

Wikimania Goes Outdoors
 (please never again)
Thanks to the kind support of Wikimedia Israel, I attended Wikimania 2016 in the alpine village of  Esino Lario located in the Lombardy region of northen Italy. I was also able to attend the Hackathon event.  Compared to other Wikimedia conferences which I have attended this was a challenging event.

The Wikimedia foundation has undergone a management crisis with the result of staff leaving in droves. For this Wikimania, the number of scholarships for staff has been restricted so attendance by staff developers was much less with many development teams not being represented.  This has had two outcomes one is that it is unclear what is being developed by WMF these days but the upside is that community projects and developers got more emphasis.

The Kiwix Ecosystem

Kiwix - the offline wikipedia reader
Two such projects are Kiwix and the "community wishlist". Kiwix is the brainchild of Emmanuel Engelhart (User:Kelson) and this Wikimania was preceded by a week-long hackathon focused on Kiwix. The Kiwix software  is an interesting project in many ways its main focus is an offline version fo Wikipedia which might at first seem unnecessary in a highly connected world. It is not a WMF project though it has gotten some support from the foundation. The Kiwix hackathon has included a couple of similar projects which distribute content like Open Street Map, Khan academy and even Ted talks for consumption on location without internet connectivity. These include schools and colleges in China, India, and Africa. Other types of communities the use offline Wikipedias some are prisoners which do not have access to online materials and a second are people living under less enlighted regimes which persecute political activists and which do not tolerate freedom of speech. Kiwix is translated using translate wiki created by Niklas Laxström (User:Nikerabbit) by Wikipedia's translator's community.

Medical Wikipedia (Offline)

Yet another related project built on Kiwix is "Medical Wikipedia" android app. This is a version of Kiwix for android Another which comes bundled with a cherry-picked collection of medical content from Wikipedia. Medical Wikipedia is a curated collection is high-quality materials with are being expanded and  translated into more and more languages. The translations are done by Wikipedia volunteers and may often be the only available medical material in the target language and as such may have very high impact on a population that has not been reachable due to a double threat of low connectivity with high communication cost and a more troubling language barrier.

Growing focus on medicine

Over the last two years, I have only attended local events organised by my local chapter.  During this time I have noticed on the social networks that more and more of my associates have been shifting their focus to medicine. On one or two occasions I even found myself volunteering at editing sessions coordinated and run by Shani Evenstein on medical subjects. This reinforces my belief that long-time Wikimedians will gravitate from their initial areas of interest to more high impact areas. This Wikimania had a large number of medicine-related talks, I take this opportunity to highlight these videos:

1. Wikipedia's coverage of medical topics by Lane Raspberry, Fred Totter

This talk underscores that a number of interesting points:

  • If "Attention is the new currency" then Wikipedia is the most requested, published, accessed and consulted source of health and information in the world.
  •  success and even the "best model of sharing" for medical information are not yet understood.
  • Wikipedia is open for partnership with  individuals and organizations, so long as they follow Wikipedia's guidelines.
  • looked at research - By Doc Helman JM West AG Wikimedia and medicine
2. Medical topics by James Heilman

3. Wikimania 2016 - Wikiversity Journal of Medicine by Mikael Haggestrom

4. Wikimania 2016 - Wikipedia Addiction and it's Comorbidites by Kritzolina

This talk confirmed a number of things:

  • You got to be nuts to edit this Wikipedia thing :-) and it is definitely habit forming.
  • If people do not respond to policy try appealing at an emotional level.
  • If that does not work - they may not be "sane" at the moment

Further reading:

  1. To read more about Kiwix and it founder look at this blog post: Emmanuel Engelhart, Inventor of Kiwix: the Offline Wikipedia Browser.
  2. Wikiversity Journal of Medicine
  3. Heilman JM, West AG (2015). "Wikipedia and medicine: quantifying readership, editors, and the significance of natural language". J. Med. Internet Res. 17 (3): e62. doi:10.2196/jmir.4069. PMID 25739399.

Saturday, October 24, 2015

What tools and developer resource are available for Polymer ?

Data scientists need a platform for disseminating information and ideas. Polymer 1.0 release is perhaps the biggest news for web development this year and as this platform from Google matures we get more information on how app/website development is viewed by Google. New tools are being released and a whole ecosystem is forming around polymer and web components.

Tuesday, August 4, 2015

The age of Selenium

Selenium logo

Some quick notes from the 10th selenium meetup which took place at the IronSource offices.

My interests were primarily for testing websites and the spacial use case involving "analytics" setups. It turns out that both simple and complicated analytics setups require about six testing sessions. When working with large web sites analytics setup tend to get broken routinely. While custom alerts can help to detect large disasters, a automated testing tool is invaluable for tracking these type of failures early.

Oxygen IDE

English: Selenium IDE screenshot
The bleated Selenium IDE
 (Photo credit: Wikipedia)
The meetup started with an unscheduled  lightning talk by Nachum Dimer. This, for me, was the most productive talk, quite possibly because it was just 10 minutes and so had no fluf. It covered Oxygen Ide. Oxygen is a new Selenium framework and integrated development environment. Oxygen IDE allows recording on Internet Explorer, Firefox and Chrome. which is a significant improvement considering that Selenium IDE, which has not been has been updated for a long time, only supports recording in only. Oxygen supports recording in chrome and other browsers. Besides being an IDE is has command line interface as well as an API. Another advantage is that it records to JavaScript which means that the work flow can be 100% JavaScript and browser based. When a test fails it allows automated capture of the UI image for later inspection. The reports are in XML which is ideal for a CI system like Jenkins. Oxygen is also able to Interface with restful web services and with databases - allowing for a very complete testing solution in a single package. Needless to say I've started working with Oxygen!

Ruby based testing stack at MyHeritage

English: Watir logo
(Photo credit: Wikipedia)
In the second talk, Matan Goren and Yehuda Miller discussed how MyHeritage uses Watir, Ruby, Cucumber and other supporting technologies to automate end to end testing at scale, on a high volume website. Goren explained how QA automation engineers are an integral part of the Continuous Deployment process. His slide deck is online. The company has grown significantly in the last year and reached a limit for work with the semi automated tests. The technology used for testing is:

  • Ruby
  • Cucumber 
  • The PageObject Gem
  • Grid Grouper (legacy grid component)
  • Selenium Grid and Grid Extras 
  • Jenkins for CI
Cucumber and cross section
 (Photo credit: Wikipedia)
The highlights are that WATIR and PageObject Gem which reduce setup and overhead of working directly with Selenium. These pay greater dividends when used as primitives in which are accessed in plain English using Cucumber which is the Behavior driven design tool of choice. Grids are then used to farm out tests on multiple browser simultaneously both for production and for staging and Jenkins does Continuous Integration. by using these tools they are able to get both QA people with background in manual to automate tests and to involve more developers in coding tests.

Robot Framework at IronSource

NXT Robot
The Robot Framework
 (Photo credit: Wikipedia)
The final talk was rather long and each speaker ended up repeating much of what the previous ones had covered. The speakers were Naphtali Gilead, Ben Weisblatt, and Ariel Levy from the InstallCore team at IronSource. They covered how they use the open source Robot Framework  and how it fits into IronSource's continuous integration life-cycle. Their technology stack is more python based. InstalCore is a monetized installer which generates revenue by offering people additional products. The testing involves complex advertising and installations. While the Robot Framework is a highly flexible test automation framework it relies on Python keywords. Its advantages are its mature libraries, informative reports, and ease of use for QA engineers. The talk will also touch on how it complements Selenium as a full end-to-end automation solution. The keywords allow layering the testing code into a number of levels. The top level with the high level keywords reads like English and the low level keywords are implemented in Python and are much closer to the actual SUT. All in all the Keyword solution seems very similar to Cucumber. One insight from this talk was that test should not be fragile but the developers also explained that they run all tests tests three times since their environment itself is rather fragile. This was a bit difficult for some people to get thier heads around. If you are interested in the Robot Framework have a look at this slide deck. And if you want more information on how it is used at IronSource here is a blog post covering this in more detail.


While some talks are invariably TLDR, the level of the Selenium Meetups are top notch and a great opportunity to learn from some of the top companies and to connect with the community since most of the selenium testing stacks are open source. Long live oxygen!

Thursday, May 28, 2015

Organizational challenges for the cult(ure) of analytics

All things data 2015

Last month I had the fortune to attend "All thing data 2015" Conference in Tel-Aviv. The turnout was impressive and despite a stiff entrance fee, almost a thousand people come to learn from a lineup of local and international "experts" on Web Analytics. The crowd was a mix of consultants, sophisticated digital media buyers, analytics instructors and numerous web analysts from newly established startups and from more established companies all unified by an appreciation of their vested interest in monitoring their web presence.

Organizational challenges

While structural variation are not uncommon
the ability derive decisions from data remains
a major challenge of firms.
One of the first talks was given by Mr. Zvika Jerbi who is a consultant working for SWC. Coming from an academic background he lectured the audience on the organizational challenges of implementing a culture of analytics in more traditional business settings. The begging of the talk hinged on defining what is a data driven company. I am familiar with a number of companies which have embraced the data-driven approach to decision making but they are few and far between. The data-driven approach rarely takes off due to a conflict with the existing  organizational culture.

The data driven company

What is a data-driven company anyhow?
So what is a data-driven company? There are a number of factors that will determine if a company should be considered. First and foremost is the availability of access to the company's data. A second requirement is that it should have well-established goals, especially for customer-facing presence. This by itself is a requirement for any modern business but what really needs to be asked in this context is to what degree are the goals defined in a way that outcomes can be quantified using well defined KPIs? A data driven company tracks usage reports for any significant changes that it launches, yet reporting is also a characteristic of most large companies. Again in this case the focus is to fosters a spirit of experimentation. So while more ideas will get to see the light of day, the mediocre ones will be weeded out that much faster. There is greater focus on performance than on features. Curiosity and good storytelling are rewarded. They routinely conduct market research and are able to reverse engineer their customers using demographics, and more business-centric segments, they use behavioral techniques  and tools to expand the voice of their  customers.

And where will data drive it ?
Smaller companies lack resources for to implementing an analytic plan or to take action based on its results. Initially, there is too little data and its quality is too poor to use for driving decisions. All too soon there is more date than can be handled. The quality or quantity that can justify the invest. But all too soon they may be looking at big data which great complexity for analyzing it. Decision makers who are most likely to ask for data are low to mid-level. Their questions beg timely answers. The big decision makers are higher up the in the organizational hierarchy and accordingly further away from the data. They need and usually get highly processed summaries of real data or long term information that has been vetted by middle and upper managers.

Enterprise entanglement

In enterprise settings agility
is the primary challenge
Next we learned that at the enterprise end of the spectrum, data sets are big - they need ETL systems to process them and private clusters or public cloud of machines to store and process the data. But once it is transformed into action or intelligence it translates into a competitive advantage over smaller competitors. One advantage of smaller companies is agility. Enterprise settings tend to create a lot of "red tape" and while everyone has clearly defined responsibilities these rarely include helping someone else get things done or look at data they did not create or process. Even when backed by evidence few decisions are final since anyone may be overridden by the system of checks and balances. Here is a typical scenario.

"When Mike, a mid-level manager sees an opportunity in the marketplace. To nail it, he asks his market research department for  the impact of creatives on different areas. Annie an analyst working on product development has done first-rate before and is asked to figure colors for product and images for creative. The  data needed is warehoused in an insular unit called IT which is run like Alcatraz by Isaac the 'IT guy'. IT and  marketing, operate in different time scales and anyhow the raw data they produce is useless for Annie. She needs it to be cleaned up, serial numbers converted to labels and so on – a task which Calvin from the data warehouse is quite capable. He transforms the raw data it and combines with other data sources. 

Annie need to explore it for a day or two before she can come up with a model that can actually do better than a coin toss. Then she needs another slice of data and a model begins starts to make some interesting suggestions. When she meets with Mike, she finds that the marketing decisions have already been made – since a competitor was announced entry into the market. Also the suggestions made by the model are the exact opposite of what was decided by Mike's superior. So by the time the results arrive weeks have passed and the fleeting opportunity has passed away."

In the majority of business cases the time required to get new queries processed to solve requires the gap is so difficult to transverse that the time required to or from the people who need it the most or who could utilize it. Decision makers are clustered at the head of the hierarchy while IT and data analyst are rarely capable of making business decisions.  associated with getting information out of large data warehouses cleaning it, processing it, using it to make models and then using these in production of data-driven services . A data scientist at Facebook described a situation where a 4-month cycle was required to get at information needed to design new user interface features. Larger clients are more frequently interested in implementing in-depth reporting rather than in an analytics service.

The main lesson for introducing analytics in enterprise settings is to start small. Which is about reducing risks, costs, overhead and most important, time to action. 

Starting up small

At the other extreme are small and medium businesses, which are leaner in terms of both personnel and their business process.

"Alice is a  web designer, she works closely with Will, a web developer living abroad, all the data is either in the cloud based website - in the log files; a cloud hosted database or in the cloud-based CRM solution. There is no data-warehouse yet and the first point of attack is Google Analytics. Paul is the VP of sales and PPC work falls under his domain. He has never had time to study Adwords or GA , and so their setup is like that if to 99% of users - it is set up for the most basic reporting and there is nothing in the works to their business. No custom segments or business specific data are in use. So analytics is used to track two landing pages in supporting role for PPC. While Alice, Will, and Paul can get the raw data to convert it into evidence for making a decision is quite demanding and in the end of the day they are not paid to analyze data. Falling behind on their daily tasks has its consequences. So all decision are made by Harris, their CEO, Harris is very hand on and when it comes to the website he constantly tries to 'borrow' an ideas from the competition or from Google. or From Yahoo. When presented with a choice he will listen for any opinions and eventually end up taking a look at what the competition was doing and take that line."

Data even if it makes an appearance will be challenged and when a  HiPPO weighs in, you don't want to get in his way. For collecting statistically significant amounts of data is not a simple undertaking. Implementing an analytics solution has significant cost - primarily in acquiring qualified manpower. The event horizon for reaching break-even for such an undertaking is often further away than a small business is able to consider.

How to promote analytics in these settings?

The lessons are: Get into the habit of looking at your data daily. Use some insights to get more eyes looking at data periodically. Once reporting is in place, prioritize analytics tasks which can lead the company to take action. E.g. Brand uplift, new users and reduce churn or optimize resources. Next go after any low hanging fruit. Hypothesis that are easy to test and metrics that are actionable items. Once you learn to deliver analysis with an action plan a steady basis you not only foster trust in a data-driven approach but create some hunger. To further expand the sphere of influence automate the process of a/b tests for continual improvement or conduct other randomized tests of new ideas. Deliver these results early in the morning as a morning coffee blurb. Finally you should take presenting your results to a higher level by delivering results as "data stories",  together with infographics and concise visualization.