Brief: Wrapper and post-processer for free OCR systems
Jump To: Parent Description

  • OCRA consists of a document corpus, built by searching Google for PDF files, and excluding those which don't contain embedded images. We then run each recognizer (both free and commercial) on this corpus and compare the results. We try to align the sentences and thus create a probability distribution by running edit distance on individual words - to ascertain the probability that a given word be translated into another word. We combine this scorer with a forest ranker operating on a statistical language model of the entire text corpus.