downloading folders from google drive.

I wanted to download some course material on RL shared by the author via Google drive using the command line.  I got a bunch of stuff using wget a folder in google drive was a challenge. I looked it up in SO which gave me a hint but no solution. I installed gdown using pip and then used: gdown --folder --continue if there are more than 50 files you need to use --remaining-ok and only get the first 50. In such a case its best to download using the folder using the UI and decompress locally. Decompressing from the command line created errors related to unicode but using the mac UI I decompressed without a glitch.

Serverless Big Data

I was at the first meeting of the p of the Big Data Analytics meetup. The first speaker was Avi Zloof CEO of EvaluteX who gave a talk titled "Serveless Big Data The Good, and the Great"

The Speaker Avi Zloof
EvaluateX which is located at "The Junction" in (Rothschild 9 Tel Aviv) is an outfit that has a chrome plugin which can optimize Google BigQuery SQL queries in the web interface. My Last BigQuery project however had abandoned the web interface and switched to 100% automation via the API. Also despite having massive queries there was little need to optimize them. I had been more concerned with comparing different editions of the projects to detect data discrepancies. The Big Data and GUI connection is often the primary challenge however this was not the subject of the talk.
The talk introduced me to EvaluateX and their activity. Mr. Zloof shared many interesting professional insights as well as his point of view regarding serverless database platforms. Mr Zloof briefly outlined the history of serverless databases, mentioning 

  • Google Big Query
  • AWS Athena
  • Azure Cloud Function
  • Azure clue functions
  • Google Cloud Functions
  • IBM whisk. 

Mr Zloof's primary takeaway message was that the pricing model is the key to correctly evaluating a platform's suitability for a company's business model.' 

The Serverless Databases Platforms for 2018
Next came a  back of the napkins calculation which posited that if processing a TB query cost 5 USD on BigQuery and creates a value of 6 USD, you have a viable business model for working with big data. I felt that while ignoring storage and networking costs might be a flaw in this rough model. However, I cannot deny that reducing the complexities of pricing cloud services is certainly an easier sell to middle and upper Managment than the labyrinthian calculations for pricing real world cloud services to produce BI systems and that this approach distills the costs of processing to their essence.

Mr Zloof noted that when Oracle CTO Larry Alison recently announced in October 2017  his company's entry into the serverless database space, wit Fn such minutiae as the pricing plan had been glossed over. Once the service's pricing is finalised, it will become the deciding factor for evaluating to what extent Oracle's new platform will be competitive in this crowded space.

Price tag per 2013 - Google big query 5 USD per TB. AWS Athena is priced at 5 USD per compressed  TB which can cost a third less than BigQuery. Some other insights were that in terms of scaling and performance although Athena is cheaper than BigQuery it is far less powerful. DynamoDB is a key value store which are not as suitable for analytics work or for general purpose work as say a SQL backend.

Mr Zloof stated that NoSql solutions are adding an SQL interface and that after many years he now feels how SQL is the lingua franca for Big Data systems.  I haven't seen this in, say, Firebase, however superlatives aside, this is definitely a trend in the evolution of NoSQL systems. Google BigQuery for example stated out with a proprietary SQL platform and now supports a more compliant SQL format. But when a NoSql database adds a sql front end it is highly unlikely that it will be as performant as a SQL backend which is where query optimization becomes important.

Another important feature not available from most of the serverless database is "Stop Loss" ability. The term which comes from the finance industry is a command to stop an operation that will have very big charges associated. Most queries need a single scan of the data warehouse - but if your op is polynomial it could run for days on your data and you might not be aware or be able to halt it without intervention from support - which could take hours to respond. So a stop loss capability can be a game changer.

All in all this was a great talk and I hope to be hearing more from this speaker in the future.




Popular posts from this blog

Moodle <=< Mediawiki SUL integration - first thoughts

downloading folders from google drive.

Insight into progressive web apps