Showing posts from December, 2017

downloading folders from google drive.

I wanted to download some course material on RL shared by the author via Google drive using the command line.  I got a bunch of stuff using wget a folder in google drive was a challenge. I looked it up in SO which gave me a hint but no solution. I installed gdown using pip and then used: gdown --folder --continue if there are more than 50 files you need to use --remaining-ok and only get the first 50. In such a case its best to download using the folder using the UI and decompress locally. Decompressing from the command line created errors related to unicode but using the mac UI I decompressed without a glitch.

Automating web app development with Polymer and Yeoman

Yeoman lets you configure and  stamp out sophisticated boilerplate projects from the command line. In a digital marketing agency  the data science team may be asked to provide for each client's campaign fairly similar  media reports and explanatory analytics dashboards for both external clients and internal clients. For longer term project we may be asked to also provide predictive analytics. The data comes from advertisers Google, Facebook, Taboola, OutBrain, phone tracking metrics API, outcomes are channels via segment into say Google Analytics which has both an API and polymer components. Usually there will be additional data science products prediction special segment data, funnels, market research, attribution charts which and long term data in BigQuery which has an API as well. Some vendors don't have an API so to access their data it is exported into a CSV and placed into google sheets which has an API and a Polymer component phone tracking. It takes too much work

Serverless Big Data

I was at the first meeting of the p of the Big Data Analytics meetup. The first speaker was Avi Zloof CEO of EvaluteX who gave a talk titled "Serveless Big Data The Good, and the Great" The Speaker Avi Zloof EvaluateX which is located at " The Junction " in (Rothschild 9 Tel Aviv) is an outfit that has a chrome plugin which can optimize Google BigQuery SQL queries in the web interface. My Last BigQuery project however had abandoned the web interface and switched to 100% automation via the API. Also despite having massive queries there was little need to optimize them. I had been more concerned with comparing different editions of the projects to detect data discrepancies. The Big Data and GUI connection is often the primary challenge however this was not the subject of the talk. The talk introduced me to EvaluateX and their activity. Mr. Zloof shared many interesting professional insights as well as his point of view regarding serverless database platfor

How to kill by name from the command line - ubuntu 17.10

Ubuntu Tip The pain So I am working on coding a react and redux component and I have a tight loop spinning in chrome. Chrome becomes unresponsive and won't stop. Soon it will eat up all the system memory and cause my machine to grind to a halt. For some reason chrome rarely detects the rapid resource growth. I used to open  a terminal and run $ ps -A to look up chrome's pid but chrome has many pids one for each window and ne per extension. My machine is slowing. I next try: $ ps -A | grep chrome this is better, I choose the first pid (I might have to scroll) and $ kill -9 <pid> And thing go back to normal. But I still haven't fixed the bug and  there has to be a better way... The remedy $ killall -9 chrome and this kill all chrome processes - one command and no lookups copy pastes etc. Note  Probably nothing ubuntu 17.10 here ....

Simpla goes open source

Simpla the headless content management system has recently announced they are closing down and making their project open source. This project allows a developer to rapidly prototype a website and a editors to manage the content from the page's ui itself. The big change is that you will not be able to move your content from the simpla database and host it on github. Headless means that a CMS don't have a huge front end like wordpress to manage the code. Instead their backend is exposed as a simple API allowing developers to use whatever integration is best suited for each user story. Headless CMS are more suitable for working with multiple channels such as android, ios app alongside a website. Trying to setup a new project using simple is easier said than done.  Once I'm up and running I'll add some more updates in this space.

Building a OCR using NLI Ephermera and manuscripts.

The OCR task can be broken down as follows. Acquire the image. Segment it into regions according to the following labels:  Image,  Text Areas with optional rotation Tabular Data with optional rotation Scale down very large text to  suitable size glyphs Improve results by adding terms to better model the noise extant on page. Improve results by using lexical and grammatical knowledge into classifier. Ideally all this should be done by an end to end system. A common complex layout of Hebrew sacred texts with non rectangular columns with related but independent  sequences Once text area are detected requires a page segmentation algorithm to break down text areas into lines and glyphs.  Looking at some samples from the NLI ephemera database one would wish to add steps to clean up and rescale some elements whose fonts are too small. Also if one had a suitable model, perhaps add details to text that is too small. The challenges are: Training on a large data

