This blog is where I share insights from any number of adventures in data analysis.
I will cover best practices from modeling and analysis projects; share tips on using new tools; outline new projects; recount war stories from Wikipedia and other FOSS projects I contribute to, and discuss challenges on information retrieval challenges, natural language processing tricks and game theoretic insights, portfolio analysis, social network analysis.
I wanted to download some course material on RL shared by the author via Google drive using the command line. I got a bunch of stuff using wget a folder in google drive was a challenge. I looked it up in SO which gave me a hint but no solution. I installed gdown using pip and then used: gdown --folder --continue https://drive.google.com/drive/folders/1V9jAShWpccLvByv5S1DuOzo6GVvzd4LV if there are more than 50 files you need to use --remaining-ok and only get the first 50. In such a case its best to download using the folder using the UI and decompress locally. Decompressing from the command line created errors related to unicode but using the mac UI I decompressed without a glitch.
Get link
Facebook
Twitter
Pinterest
Email
Other Apps
SQL Dojo
Get link
Facebook
Twitter
Pinterest
Email
Other Apps
TLDR:
Imagine just before your DS interview - you are NEO your coach is Morpheus, and you will be practice SQL in rapidly changing schemas.
Now here is a little project I thought up:
Despite any number of excellent SQL based projects I have created I tend to get rusty in SQL as I don't use it on a regular basis. I decided it might be worthwhile to setup a virtual space to practice, hence the dojo.
The dojo lets a student practice analytical sql primarily queries analysts use.
Ultimately I'd like to to use it in an agile manner as an LMS with a minimal UI. This would require creating a story for each query and a test that the query returns a good answer. Also to make things interesting the tasks should be related and proceed from easy to more challenging and cover a number of techniques like filtering, aggregation and subqueries.
However, initially I want to have things up and running quickly and to collect questions and answers that reflect how to do create views on a small number of databases from courses or books. Also this system can also be used to see how well things work on different dbms with a goal of doing things in a portable fashion.
I thought I might share some specifics. The POC features should be:
Run server in docker - easy to install/restart/migrate (done)
Agile access - e.g. using visual studio code + pluging. (done)
Rich clients - MySQLWorkBench (done)
SquirelSQL - supports more RDBS systems. (done)
Access from Jupyter (done - but less agile)
Beyond the POC
Migrate db to AWS (more & bigger databases).
create a web interface to
switch RDBMS
log in
enter and run queries
show output log
show query output.
store queries history
keep score
indicate progress in units.
feedback and discussion.
allow users to add stories and queries.
support non-sql dbs as well like
Develop small learning units to practice techniques.
[OK] basics
[OK] filtering
[OK] aggregation
[OK] subqueries
[] cleaning data & SQL wrangling
[] OLAP
[] design and ddl
[OK] CRUD + stored procedures python.
[] CRUD + stored procedures R.
[] CRUD + stored procedures Java.
[] transaction
[] create queries for a bi dashboard.
[] create queries for a marketing automation project.
Migrate queries to database
Show schema for the database.
Make things secure.
Isolated user.
reset DB.
Use serverless backends too
aws athena.
google bigquery.
Use noSQL dbs - mongo, neo4,
Connect to a dedicated environment like MySQLWorkBench
Connect to a BI environment or Tableau / Power BI.
Use a freemium hosted database like bigquery.
first snag:
accessing mysql v>8.0 requires a new protocol. I had to re-enable the old one using some obscure command to allow user + password connection or change to the mysql.connector.connect connector instead
TODO: find this snag it and record.
TODO: add this hack to the mysql docker image.
TODO: automate the docker image to run script to create and load data from a folder.
TODO: add a docker image for postgres with equivalent capabilities.
TODO: put the docker images @ AWS
TODO: get a docker image with MySQL sample database as it is used in many tutorials.
TODO: migrate project to trello.
Updates:
I installed Squirrel SQL to access multiple dbs via rich client.
I installed GraalVM to do polyglot data science in a notebook.
I created a jupyter notebook to access mysql database.
This is good for accessing a local database.
I plan to update this to practicing Polyglot data wrangling. i.e. get data from db into R and Python data frames and do some quick explorations.
I wanted to download some course material on RL shared by the author via Google drive using the command line. I got a bunch of stuff using wget a folder in google drive was a challenge. I looked it up in SO which gave me a hint but no solution. I installed gdown using pip and then used: gdown --folder --continue https://drive.google.com/drive/folders/1V9jAShWpccLvByv5S1DuOzo6GVvzd4LV if there are more than 50 files you need to use --remaining-ok and only get the first 50. In such a case its best to download using the folder using the UI and decompress locally. Decompressing from the command line created errors related to unicode but using the mac UI I decompressed without a glitch.
AWS CloudFormation Pros and Cons So I'm building a PAAS product that does ML based optimisations and that means doing work in the cloud. The ML is a neat feature but without the basic product nothing will happen and to bootstrap this project on AWS I tried to make use of CloudFormation a service that automates creation and destruction of service stacks. Based on a week's worth of experimenting with CloudFormation I will try to answer the question: "Is learning CloudFormation worth the effort?" Despite the rant CloudFormation support creation, updating and deletion of entire stacks of services. SAM is built on top of CloudFormation and It has a visual editor. The way CloudFormation is described, is that you can copy paste snippets to create resources and build a library of reusable components. This is a simplistic point of view. In reality you need to bring properties, specify dependencies, and introduce signalling mechanisms to ensure your template works. T
Data science interviews can be over whelming New Year, New Data Scientist Job: 5 Things To Think About My notes: https://www.meetup.com/Big-Data-Analytics-Israel/events/253124286/ The first talk was by: Raya Belinsky - "New job - yes or no?" The talk about finding your next job or reinventing your current jobs. Miss Belinsky's humour and background as an executive life-coach made this talk both pleasant and worth-while. She covered her operational definition of job burnout Linkin profile - complete the profile (it tells you what to do) The CV - ask 2 people to prepare it The Interview - e.g. prepare 3 questions Each had at least a couple of points worth taking care of in your next round of job search. Check out the talk and slides when they go online. Second talks by: Nathaniel Shimoni - "Life story" Mr Shimoni is an experienced story and had a compelling story to tell and his own twiting path to becoming a da
Comments
Post a Comment