OCRA
-
OCRA consists of a document corpus, built by searching Google for
PDF files, and excluding those which don't contain embedded
images.
We then run each recognizer (both free and commercial) on
this corpus and compare the results.
We try to align the
sentences and thus create a probability distribution by running
edit distance on individual words - to ascertain the probability
that a given word be translated into another word.
We combine
this scorer with a forest ranker operating on a statistical
language model of the entire text corpus.