downloading folders from google drive.

I wanted to download some course material on RL shared by the author via Google drive using the command line.  I got a bunch of stuff using wget a folder in google drive was a challenge. I looked it up in SO which gave me a hint but no solution. I installed gdown using pip and then used: gdown --folder --continue https://drive.google.com/drive/folders/1V9jAShWpccLvByv5S1DuOzo6GVvzd4LV if there are more than 50 files you need to use --remaining-ok and only get the first 50. In such a case its best to download using the folder using the UI and decompress locally. Decompressing from the command line created errors related to unicode but using the mac UI I decompressed without a glitch.

Random Thoughts on Linear Regressions

Regression Analysis

TLDR

Regression is the oldest and most powerful tool in a data scientist's toolbox. Under ideal conditions multiple linear regression would be the best and only tool a data scientist would want to use... In reality you would use a modern nonparametric variant or a different algorithm. Still, the main ideas I . discuss here  will pop up in many other models and algorithms. This post is my brain dump on regression - I'll update it as time allows to cover the many aspects of this technique.


Advantages

  1. Regression models are easy to run and interpret.
    • Can be used within a workflow to identify the most important positive and negative influence of independent variables on the dependant variable.
    • Easy to understand how well the data is used.
    • the dreaded p-values indicate significance level of the independent variables.
  • Provide not just a prediction but also an assessment of the errors
  • Benefits from visual exploration of the data and the regression model.
    • Diagnostic plots can show if the regression assumption works
    • Small multiples can expose dependence between variable

RealPolitik

Parametric regression easily too few or too many parameters for a good fit on a general function. When regression was invented by Carl Friedrich Gauss and
Adrien-Marie Legendre it was called the method of least squares and their data sets were a tiny set of astronomical observations i.e. a small number of rows and a constant number of independent number of variables. Fast forward to twenty years ago we had millions of rows and either a usually much smaller number of columns. And today we can also have millions of columns. The old Parametric models provide a poor fit this data.

In many cases in data science as the number of independent variables grows dramatically all subtleties in the analysis vanish and the immediate choice becomes a logistic regression.

Assumptions


First let us recall the underlying assumptions of Linear regressions:

1. Linear relationship.

Can be visually inspected in the triangle above the diagonal of the SPLOM chart, that is a scatter plots matrix for each independent variable against the dependent variable augmented by a fit line.
Might be be fixable using some transform of the variable or by dropping it.

2. Multivariate normality

Can be visually inspected via Q-Q plots in the diagonal or verified numerically using a goodness of fit test e.g. Kolmogorov-Smirnov test.
Again a transform may help or we may drop this variable.

3. No or little multicollinearity

Can be visually inspected using half a Correlation matrix below the diagonal of the SPLOM

4. No autocorrelation

Time series present a problem as do most type of structured data in which observations might not be independent. Autocorrelation may or may not be obvious to spot on visual inspection inspection.
sub-sampling then using ensembles on different samples might help converting the variable to a moving average and a residual might help but it might just make the autocorrelation harder to view.

5. Homoscedasticity

This occurs when the variance of a variable grows as the variable grows and can be inspected visually by looking at the scatterplot in the SPLOM. The Goldfeld-Quandt Test can be used to do a numeric test.

Could be fixed using applying a nonlinear transform such as a log function or by dropping the offending variable.

6. Sample size

While a regression model can be fit to a small amount of data it should be fitted to sufficient data to produce good results. A pr practitioner would prefer to end up with a model in which each independent variable has statistical significance and so one might expect that if the above requirements are followed we should require the sample (number of rows) to be some factor of the independent parameters (columns) for this. 

The rule of thumb "50 responses minimum and at least 10 responses per independent variable" may fail if your data set is unbalanced. Also if you are doing a prediction in some area with no sample point the model may not be valid even if you have tons of data.

Realities

So we might think we are ok after doing these additional visualizations and inspections, transforming variables and rerunning the regression dropping correlated variables and statistically insignificant ones. But in the real world and despite assurances from the central limit theorem and friends normality is rarely encountered. So one expects that most independent variables will not be able to pass item 2 on the list above.

Hard to fit small samples on many different groups. One rule of thumb says 20 samples per random variables.


Under The Hood

I have talked with experienced data scientists who were surprised to hear that there exists an closed form formula for parametric linear regression. However this is no longer true when we use modern computer age algorithms for our regression. cf. Computer Age Statistical Inference by Bradley Efron and Trevor Hastie
Inference is based on the following  $$(A^TA)^{-1}Ay$$ which has handled by algorithms.

Common Techniques

Dummy Variables for Categorical variables

Categorical variables are non numeric features such as color, city, etc. We cannot use these directly in a regression analysis so we need to represent them numerically. The natural choice of using a numerical encoding introduces a bias and will make it much more difficult to interpret the results. For this reason categorical variables should be encoded using dummy variables. In the case of color - each color get a new column is_color. We now have more columns and may need to reassess the significance levels of the data. If most samples are have is_red as true we may have problems of significance with some of the dummy variables.

Transforms

We could transform variables to correct for problems in the data.

Interactions

We can also include interactions between independent variables a and b by including new variables such as $a \times b$ , or $a^2 \times b$, $a \times b^2 $ etc. This is a form of feature engineering in machine learning.


References



Comments

Popular posts from this blog

Moodle <=< Mediawiki SUL integration - first thoughts

downloading folders from google drive.

AWS CloudFormation Pros and Cons