Much of the worlds information is held on paper and PDFs, or are simply scans of physical documents. Document Analysis and Recognition (DAR) is the term for the effort to use computers to crack open these static documents to make them more usable and useful.
Once unlocked and machine readable, there are a lot of things that can be done with documents using what’s called text mining or text analytics, including:
Multiple companies offer text analytics as machine-learning-as-a-service microservices including:
For documents that are not machine readable — like those that are scanned as PDFs — optical character recognition (OCR) is the key means for text recognition and is the conversion of characters in a digital image to digital text. Although commercial OCR dates back to the 1950s and results can be very impressive, obtaining consistently high accuracy rates continues to be a challenging problem.
The best commercial OCR capabilities are available as machine-learning-as-a-service microservices including:
Recognizing handwritten text is an even more formidable task than OCR and the state-of-the-art is not very good. Handwriting Text Recognition (HTR) systems must handle overlapping characters, a mixture of cursive and non-cursive, and huge variations in writing styles. The task can be nearly impossible in some cases. Many of us have even had the strange experience of struggling to read our own handwriting.
Until recently, HTR recognition accuracy improved at a slow pace. Most gains were minimal and resulted from small tweaks to existing language model techniques, such as Hidden Markov Models (HMMs). The core algorithms remained fundamentally unchanged and recognition rates were low for even the best HTR systems.
Recent advances in machine learning, however, have revolutionized the field. In particular, the use of Convolutional Neural Networks (CNNs or ConvNets) and Long Short-Term Memory (LSTMs) networks have produced the most significant accuracy improvements in decades. These hybrid deep networks are more robust, handle a larger range of handwriting inputs, and constitute a fundamentally new approach to HTR.
LSTM networks are a type of Recurrent Neural Network (RNN) that can learn tasks requiring memories of events that happened thousands or even millions of discrete time steps earlier. This makes them ideal for HTR where letter and word orders are highly correlated.
Tesseract 4.0, an open source multilingual OCR/HTR engine maintained by Google, was re-architected in the summer of 2017 to use a hybrid CNN/LSTM deep neural network. The model was trained for several weeks on a corpus of 400,000 text lines spanning approximately 4,500 fonts. The reported accuracy gains are tremendous and the engine now supports over 100 languages.
Despite the impressive gains achieved with deep learning techniques, HTR continues to trail OCR in performance and accuracy. There are several key best practices one can follow, however, to help improve recognition results. These include
For the times when computers can’t accurately assess either text or handwritten data, have low confidence on their findings, or run across situations with exceptions, the fallback is to create a human-in-the-loop workflow to properly identify what was written. In other words, a person is asked to read what something says and type the answer. With this approach, an overall workflow can be very accurate, even if the OCR and HTR can’t handle certain situations. Top vendors of these human-in-the-loop workflow services include Alegionand Figure Eight.
Finally, for those interested in digging in deeper into these areas, there are several important technical conferences on Document Analysis and Recognition held annually:
New deep learning techniques have revolutionized the field of document and text analysis and are contributing to dramatic improvements in the state-of-the-art. Unlocking insights from unstructured data captured in static documents has broad applications with new use cases popping up all the time. Unfathomable amounts of data and insights are currently hidden in billions of physical and PDF documents. Imagine the intelligence and informed actions your business could unlock with these new technologies.