downloading folders from google drive.

I wanted to download some course material on RL shared by the author via Google drive using the command line.  I got a bunch of stuff using wget a folder in google drive was a challenge. I looked it up in SO which gave me a hint but no solution. I installed gdown using pip and then used: gdown --folder --continue https://drive.google.com/drive/folders/1V9jAShWpccLvByv5S1DuOzo6GVvzd4LV if there are more than 50 files you need to use --remaining-ok and only get the first 50. In such a case its best to download using the folder using the UI and decompress locally. Decompressing from the command line created errors related to unicode but using the mac UI I decompressed without a glitch.

Building a OCR using NLI Ephermera and manuscripts.


The OCR task can be broken down as follows.
  1. Acquire the image.
  2. Segment it into regions according to the following labels: 
    1. Image, 
    2. Text Areas with optional rotation
    3. Tabular Data with optional rotation
  3. Scale down very large text to  suitable size glyphs
  4. Improve results by adding terms to better model the noise extant on page.
  5. Improve results by using lexical and grammatical knowledge into classifier.
Ideally all this should be done by an end to end system.

A common complex layout of Hebrew
sacred texts with non rectangular
columns with related but independent
 sequences
Once text area are detected requires a page segmentation algorithm to break down text areas into lines and glyphs. 

Looking at some samples from the NLI ephemera database one would wish to add steps to clean up and rescale some elements whose fonts are too small. Also if one had a suitable model, perhaps add details to text that is too small.

The challenges are:
  1. Training on a large data set of glyphs (Characters used in the fonts one needs to recognize.) DNN and OCR engines in general are seem to be inflexible recognizing similar data as they simply cannot generalize beyond what they are trained on.
  2. Flexible Segmentation and figuring the most correct sequence of text blocks so that the page is logical. (Hebrew is RTL English is LRT and they could also be mixed. Using images + hand converted results from Project Gutenberg, Ben Yehuda and other project could be useful as would be processing of manuscripts which would introduce greater recognition ability.  This needs to be formulated into the a loss function.
  3. Improving results by combining lexical and grammatical data into the loss function to select best sub sequences.
  4. Learn to model real noise from different types of document. Noise can be modeled as digitization artifacts, aging, wear and tear, gutters, dirt) and separate it from the signal adaptively.
Ephemera also have both non-standard layouts,
 unusual letter size variation as well as unexpected content.

Ideally one would capture vector version of glyphs in the most common fonts and in all their weights and variants (including ligatures etc) and use these together with frequencies to model the corpus.

But how to train it as a GAN (Generative Adversarial Neural Network)?

A piece of software to generate glyphs. (N category Classier)
A piece of software to generate suitable textual sequences using these glyphs. (RNN or LSTM)
A piece of software to generate text in different layouts and composite pages. (GAN)
A function that checks how close the above to real pages scanned in a database. (Requirements are for a similarity threshold.)

Train the different elements together.

References






Comments

Popular posts from this blog

Moodle <=< Mediawiki SUL integration - first thoughts

downloading folders from google drive.

AWS CloudFormation Pros and Cons