Posts

Showing posts from 2017

downloading folders from google drive.

I wanted to download some course material on RL shared by the author via Google drive using the command line.  I got a bunch of stuff using wget a folder in google drive was a challenge. I looked it up in SO which gave me a hint but no solution. I installed gdown using pip and then used: gdown --folder --continue https://drive.google.com/drive/folders/1V9jAShWpccLvByv5S1DuOzo6GVvzd4LV if there are more than 50 files you need to use --remaining-ok and only get the first 50. In such a case its best to download using the folder using the UI and decompress locally. Decompressing from the command line created errors related to unicode but using the mac UI I decompressed without a glitch.

Automating web app development with Polymer and Yeoman

Image
Yeoman lets you configure and  stamp out sophisticated boilerplate projects from the command line. In a digital marketing agency  the data science team may be asked to provide for each client's campaign fairly similar  media reports and explanatory analytics dashboards for both external clients and internal clients. For longer term project we may be asked to also provide predictive analytics. The data comes from advertisers Google, Facebook, Taboola, OutBrain, phone tracking metrics API, outcomes are channels via segment into say Google Analytics which has both an API and polymer components. Usually there will be additional data science products prediction special segment data, funnels, market research, attribution charts which and long term data in BigQuery which has an API as well. Some vendors don't have an API so to access their data it is exported into a CSV and placed into google sheets which has an API and a Polymer component phone tracking. It takes too much work

Serverless Big Data

Image
I was at the first meeting of the p of the Big Data Analytics meetup. The first speaker was Avi Zloof CEO of EvaluteX who gave a talk titled "Serveless Big Data The Good, and the Great" The Speaker Avi Zloof EvaluateX which is located at " The Junction " in (Rothschild 9 Tel Aviv) is an outfit that has a chrome plugin which can optimize Google BigQuery SQL queries in the web interface. My Last BigQuery project however had abandoned the web interface and switched to 100% automation via the API. Also despite having massive queries there was little need to optimize them. I had been more concerned with comparing different editions of the projects to detect data discrepancies. The Big Data and GUI connection is often the primary challenge however this was not the subject of the talk. The talk introduced me to EvaluateX and their activity. Mr. Zloof shared many interesting professional insights as well as his point of view regarding serverless database platfor

How to kill by name from the command line - ubuntu 17.10

Image
Ubuntu Tip The pain So I am working on coding a react and redux component and I have a tight loop spinning in chrome. Chrome becomes unresponsive and won't stop. Soon it will eat up all the system memory and cause my machine to grind to a halt. For some reason chrome rarely detects the rapid resource growth. I used to open  a terminal and run $ ps -A to look up chrome's pid but chrome has many pids one for each window and ne per extension. My machine is slowing. I next try: $ ps -A | grep chrome this is better, I choose the first pid (I might have to scroll) and $ kill -9 <pid> And thing go back to normal. But I still haven't fixed the bug and  there has to be a better way... The remedy $ killall -9 chrome and this kill all chrome processes - one command and no lookups copy pastes etc. Note  Probably nothing ubuntu 17.10 here ....

Simpla goes open source

Simpla the headless content management system has recently announced they are closing down and making their project open source. This project allows a developer to rapidly prototype a website and a editors to manage the content from the page's ui itself. The big change is that you will not be able to move your content from the simpla database and host it on github. Headless means that a CMS don't have a huge front end like wordpress to manage the code. Instead their backend is exposed as a simple API allowing developers to use whatever integration is best suited for each user story. Headless CMS are more suitable for working with multiple channels such as android, ios app alongside a website. Trying to setup a new project using simple is easier said than done.  Once I'm up and running I'll add some more updates in this space.

Building a OCR using NLI Ephermera and manuscripts.

Image
The OCR task can be broken down as follows. Acquire the image. Segment it into regions according to the following labels:  Image,  Text Areas with optional rotation Tabular Data with optional rotation Scale down very large text to  suitable size glyphs Improve results by adding terms to better model the noise extant on page. Improve results by using lexical and grammatical knowledge into classifier. Ideally all this should be done by an end to end system. A common complex layout of Hebrew sacred texts with non rectangular columns with related but independent  sequences Once text area are detected requires a page segmentation algorithm to break down text areas into lines and glyphs.  Looking at some samples from the NLI ephemera database one would wish to add steps to clean up and rescale some elements whose fonts are too small. Also if one had a suitable model, perhaps add details to text that is too small. The challenges are: Training on a large data

The Happy winners #nlihack 2017

Image
About the event  The happy winners! Amir Aharoi, Oren Bochman and Chaim Cohen Last week (November 23-24 2017) I had the pleasure of participating in the first  National Library of Israel's Hackathon . I've been to the NLI a few times with friends from the Wikimedia movement to instruct its staff and students about editing Wikipedia. But at the hackathon, the NLI opened its doors to the best and brightest minds to help out with tagging content and dissemination of its extensive image database. The Team You can't win a hackathon without a great team. My team consisted of seven developers which have been a part of the core community of Wikimedia developers in Israel and have been meeting irregularly since the International Wikimedia hackathon Organized by Wikimedia Israel last year in Israel.  We had met about a week before the event at the Local chapter's offices and discussed over pizza what we wanted to do and what the NLI had asked us to do. I realized that

My first BigQuery DWH

Image
BigQuery is Google's Analytics Database Some notes about a project that has taken up lots of time recently. It was a media attribution dashboard for a client running several hundred campaigns. A POC version of the project had been created manually using spread sheets and we had to provide a drop in a replacement ASAP . I took up the task of migrating a spreadsheet based BI to a more robust script and SQL based platform able to handle the rapidly aggregating data which would soon overpower the spreadsheet's models. A secondary challenge was that the entire system were analyzing was under development and would change daily. Despite it's lacks as a classical database (missing triggers and in schema protections) I choose BigQuery for its scale-ability and ease of integration. Despite its limitations it soon felt like a perfect fit for this type of project. Data collection Data is currently acquired daily via API from various platforms: for example Google AdWords an

How to take out the Trash from command line in Ubuntu 17.10

>How to use the trash from the command line? The pain Setting up new projects is frequently time consuming, with many false starts until everything is setup right. In fact once CI is set up and the version on the local machine is less important. I've been encountering this Ubuntu annoyance whenever starting a new project. I could create smart aliases for rm with a command line trash folder. But there would be two trash folders. I just want to access the same trash folder from the command line that I can access through the desktop. It also turns out that this has been the subject of not one not two but at east three packages. This following option is quick, safe (as it is reversible) and lets us focus on the the setup. Doing machine learning also creates big models and large downloaded data sets that can hog up the limited fast storage. Still I don't enjoy retraining a big model because I accidental tossed out the last good model with all the previous runs. The reme

Lay the foundation faster

Image
I've recently started to work wit foundation.js. My primary goal for work with foundation is to generate very fast mockups. After completing the first project I've taken a bit of time to learn a bit more. Here are some insights which were difficult to discover within the documentation: #slack is available but not active enough to get answers Foundation is primarily sass or css framework. If you need to do anything more than build a mockup it is probably not going to have any code you need. Using multiple document layouts:  managed by panini which is really  handlebars.js can speed up static prototyping.  may help with PWA using http 2.0  you need to have a {{> body}} handlebar in the template. you have to add YAML FrontMatter to to the page (is this mr Jekyll ?) the YAML needs to reference the layout ie: layout: file-name-without-ext if you reference missing layout you will get errors. Panini is based on handlebars.js and handlebars may have you

Popular posts from this blog

Moodle <=< Mediawiki SUL integration - first thoughts

downloading folders from google drive.

AWS CloudFormation Pros and Cons