Building a OCR using NLI Ephermera and manuscripts.

December 03, 2017

The OCR task can be broken down as follows.

Acquire the image.
Segment it into regions according to the following labels:

Image,
Text Areas with optional rotation
Tabular Data with optional rotation

Scale down very large text to suitable size glyphs
Improve results by adding terms to better model the noise extant on page.
Improve results by using lexical and grammatical knowledge into classifier.

Ideally all this should be done by an end to end system.

A common complex layout of Hebrew
sacred texts with non rectangular
columns with related but independent
sequences

Once text area are detected requires a page segmentation algorithm to break down text areas into lines and glyphs.

Looking at some samples from the NLI ephemera database one would wish to add steps to clean up and rescale some elements whose fonts are too small. Also if one had a suitable model, perhaps add details to text that is too small.

The challenges are:

Training on a large data set of glyphs (Characters used in the fonts one needs to recognize.) DNN and OCR engines in general are seem to be inflexible recognizing similar data as they simply cannot generalize beyond what they are trained on.
Flexible Segmentation and figuring the most correct sequence of text blocks so that the page is logical. (Hebrew is RTL English is LRT and they could also be mixed. Using images + hand converted results from Project Gutenberg, Ben Yehuda and other project could be useful as would be processing of manuscripts which would introduce greater recognition ability. This needs to be formulated into the a loss function.
Improving results by combining lexical and grammatical data into the loss function to select best sub sequences.
Learn to model real noise from different types of document. Noise can be modeled as digitization artifacts, aging, wear and tear, gutters, dirt) and separate it from the signal adaptively.

Ephemera also have both non-standard layouts,
unusual letter size variation as well as unexpected content.

Ideally one would capture vector version of glyphs in the most common fonts and in all their weights and variants (including ligatures etc) and use these together with frequencies to model the corpus.

But how to train it as a GAN (Generative Adversarial Neural Network)?

A piece of software to generate glyphs. (N category Classier)
A piece of software to generate suitable textual sequences using these glyphs. (RNN or LSTM)
A piece of software to generate text in different layouts and composite pages. (GAN)
A function that checks how close the above to real pages scanned in a database. (Requirements are for a similarity threshold.)

Train the different elements together.

Search This Blog

Bayesian.Ninja

downloading folders from google drive.

Building a OCR using NLI Ephermera and manuscripts.

References

Comments

Post a Comment

Popular posts from this blog

Random Thoughts on Linear Regressions

Big Data Analytics Israel - New Year, New Data Scientist Job: 5 Things To Think About

Moodle <=< Mediawiki SUL integration - first thoughts