Sunday, December 3, 2017

Building a OCR using NLI Ephermera and manuscripts.

The OCR task can be broken down as follows.
  1. Acquire the image.
  2. Segment it into regions according to the following labels: 
    1. Image, 
    2. Text Areas with optional rotation
    3. Tabular Data with optional rotation
  3. Scale down very large text to  suitable size glyphs
  4. Improve results by adding terms to better model the noise extant on page.
  5. Improve results by using lexical and grammatical knowledge into classifier.
Ideally all this should be done by an end to end system.

A common complex layout of Hebrew
sacred texts with non rectangular
columns with related but independent
Once text area are detected requires a page segmentation algorithm to break down text areas into lines and glyphs. 

Looking at some samples from the NLI ephemera database one would wish to add steps to clean up and rescale some elements whose fonts are too small. Also if one had a suitable model, perhaps add details to text that is too small.

The challenges are:
  1. Training on a large data set of glyphs (Characters used in the fonts one needs to recognize.) DNN and OCR engines in general are seem to be inflexible recognizing similar data as they simply cannot generalize beyond what they are trained on.
  2. Flexible Segmentation and figuring the most correct sequence of text blocks so that the page is logical. (Hebrew is RTL English is LRT and they could also be mixed. Using images + hand converted results from Project Gutenberg, Ben Yehuda and other project could be useful as would be processing of manuscripts which would introduce greater recognition ability.  This needs to be formulated into the a loss function.
  3. Improving results by combining lexical and grammatical data into the loss function to select best sub sequences.
  4. Learn to model real noise from different types of document. Noise can be modeled as digitization artifacts, aging, wear and tear, gutters, dirt) and separate it from the signal adaptively.
Ephemera also have both non-standard layouts,
 unusual letter size variation as well as unexpected content.

Ideally one would capture vector version of glyphs in the most common fonts and in all their weights and variants (including ligatures etc) and use these together with frequencies to model the corpus.

But how to train it as a GAN (Generative Adversarial Neural Network)?

A piece of software to generate glyphs. (N category Classier)
A piece of software to generate suitable textual sequences using these glyphs. (RNN or LSTM)
A piece of software to generate text in different layouts and composite pages. (GAN)
A function that checks how close the above to real pages scanned in a database. (Requirements are for a similarity threshold.)

Train the different elements together.