Old documents will have many peculiarities that will be missed by FineReader unless we point the software in the right direction. Using custom User Patterns, we train ABBYY FineReader's OCRing algorithm to correctly classify characters in our historic data. Hey, don't do that." Anyone who has ever had to get their hands dirty with machine learning will immediately recognize the intuition. That British pound sign is really a number. Training is used to tell FineReader, "Hey, don't do that. Below is an example of historic document OCRd using FineReader's default recognition schemes many numbers have been replaced by letters or strange characters.
If you OCR the document using the default pattern recognition settings you will likely be disappointed. Say you're working with an old scanned document and you wish to extract its tables. And luckily, training ABBYY FineReader's engine is pretty easy. If you have ever worked with machine learning-type projects and/or text analysis, training software to properly classify stuff is a familiar concept. Old statistical documents often use long-gone proprietary typefaces. While modern OCR software can easily read Arials and Times New Romans, it needs help with more exotic typography this is where training come in. Most historical digitization projects will entail training. Now, I dig into some important digitization nitty gritty: training optical character recognition software to properly read historical content. In a previous tutorial I covered the basics of digitizing old statswith ABBYY FineReader (& alternative digitization tools).