awesome-ocr
awesome-ocr copied to clipboard
Links to awesome OCR projects
Awesome OCR
This list contains links to great software tools and libraries and literature related to Optical Character Recognition (OCR).
Contributions are welcome, as is feedback.
-
Software
- OCR engines
- Older and possibly abandoned OCR engines
-
OCR file formats
- hOCR
- ALTO XML
- TEI
- PAGE XML
- OCR CLI
- OCR GUI
- OCR Preprocessing
- OCR as a Service
- OCR evaluation
-
OCR libraries by programming language
- Crystal
- Elixir
- Go
- Java
- .Net
- Object Pascal
- PHP
- Python
- Javascript
- Ruby
- Swift
- Rust
- R
- OCR training tools
-
Datasets
- Ground Truth
-
Literature
- OCR-related publication and link lists
- Blog Posts and Tutorials
- OCR Showcases
-
Academic articles
- 2011 and before
- 2012
- 2013
- 2014
- 2015
- 2016
- 2017
- 2018
Software
OCR engines
-
tesseract - The definitive Open Source OCR engine
Apache 2.0
-
EasyOCR - OCR engine built on PyTorch by JaidedAI,
Apache 2.0
-
ocropus - OCR engine based on LSTM,
Apache 2.0
- ocropus 0.4 - Older v0.4 state of Ocropus, with tesseract 2.04 and iulib, C++
- kraken - Ocropus fork with sane defaults
- gocr - OCR engine under the GNU Public License led by Joerg Schulenburg.
-
Ocrad - The GNU OCR.
GPL
- ocular - Machine-learning OCR for historic documents
- SwiftOCR - fast and simple OCR library written in Swift
- attention-ocr - OCR engine using visual attention mechanisms
- RWTH-OCR - The RWTH Aachen University Optical Character Recognition System
- simple-ocr-opencv and its fork - A simple pythonic OCR engine using opencv and numpy
- Calamari - OCR Engine based on OCRopy and Kraken
- doctr - A seamless & high-performing OCR library powered by Deep Learning
Older and possibly abandoned OCR engines
-
Clara OCR - Open source OCR in C
GPL
- Cuneiform - CuneiForm OCR was developed by Cognitive Technologies
- Eye - an experimental Java OCR (image-to-text) application
- kognition - An omnifont OCR software for KDE
- OCRchie - Modular Optical Character Recognition Software
- ocre - o.c.r. easy
- xplab - A GTK 2 tool for pattern matching
-
hebOCR - Hebrew character recognition library (previously named hocr, see Wikipedia article)
GPL
OCR file formats
hOCR
-
hocr-tools - Tools for doing various useful things with hOCR files,
Apache 2.0
- hocr-spec - hOCR 1.2 specification
-
ocr-transform - CLI tool to convert between hOCR and ALTO,
MIT
- hocr-parser - hOCR Specification Python Parser
- hOCRTools - hOCR to ALTO conversion XSLT
ALTO XML
- ALTO XML Schema - XML Schema and development of the ALTO XML format
- ALTO XML Documentation - Documentation and use cases for ALTO
- alto-tools - Various tools to work with ALTO files, Python
- AbbyyToAlto - PHP script converting from Abbyy 6 to ALTO XML
TEI
- TEI-OCR - TEI customization for OCR generated layout and content information
- TEI SIG on Libraries - Best Practices for TEI in Libraries
- GDZ - METS/TEI-based GDZ document format
PAGE XML
- PAGE-XML Schema - XML schema of the PAGE XML format along with documentation and examples
- omni:us Pages Format (OPF) - XML schema very similar to PAGE XML that has some additional features.
- py-pagexml - Python library for handling PAGE XML and OPF files.
OCR CLI
- OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
- Pdf2PdfOCR - A tool to OCR a PDF (or supported images) and add a text "layer" (a "pdf sandwich") in the original file making it a searchable PDF. GUI included. Tesseract and cuneiform supported.
- Ocrocis - Project manager interface for Ocropy, see also external project homepage
- tesseract-recognize - Tesseract-based tool that outputs result in Page XML format (docker image).
OCR GUI
- moz-hocr-editor - Firefox Addon for editing hOCR files Discontinued
- qt-box-editor - QT4 editor of tesseract-ocr box files.
- ocr-gt-tools - Client-Server application for editing OCR ground truth.
- Paperwork - Using scanners and OCR to grep paper documents the easy way.
- Paperless - Scan, index, and archive all of your paper documents.
- gImageReader - gImageReader is a simple Gtk/Qt front-end to tesseract-ocr.
- VietOCR - A Java/.NET GUI frontend for Tesseract OCR engine, including jTessBoxEditor a graphical Tesseract box data editor
- PoCoTo - Fast interactive batch corrections of complete OCR error series in OCR'ed historical documents.
- OCRFeeder - GTK graphical user interface that allows the users to correct characters or bounding boxes, ODT export and more.
- PRImA PAGE Viewer - Java based viewer for PAGE XML files (layout + text content). Also supports ALTO XML, FineReader XML, and HOCR.
- LAREX - A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.
- archiscribe - Web application for transcribing OCR ground truth from Archive.org. Deployed instance available at https://archiscribe.jbaiter.de/, results are available in @jbaiter/archiscribe-corpus.
- nw-page-editor - Simple app for visual editing of Page XML files. Provides desktop and server docker-based versions.
OCR Preprocessing
- NoiseRemove.java in MathOCR - Java implementation of Adaptive degraded document image binarization by B. Gatos , I. Pratikakis, S.J. Perantonis
- binarize.c in ZBar - C implementations of two binarization algorithms, based on Sauvola
- typeface-corpus - A repository for typefaces to train Tesseract and OCRopus for natural history collections and digital humanities.
- binarizewolfjolion - Comparison of binarization algorithms. Blog post
-
crop_morphology.py
in oldnyc - Cropping a page to just the text block - Whiteboard Picture Cleaner - Shell one-liner/script to clean up and beautify photos of whiteboards
- Fred's ImageMagick script textcleaner - Processes a scanned document of text to clean the text background
- localcontrast - Fast O(1) local contrast optimization
OCR as a Service
- Open OCR - Run Tesseract in Docker containers
- tesseract-web-service - An implementation of RESTful web service for tesseract-OCR using tornado.
- docker-ocropy - A Docker container for running the ocropy OCR system.
- ABBYY Cloud OCR SDK Code samples - Code samples for using the proprietary commercial ABBYY OCR API.
- nidaba - An expandable and scalable OCR pipeline
- gamera - A meta-framework for building document processing applications, e.g. OCR
- ocr-tools - Project to provide CLI and web service interfaces to common OCR engines
- ocrad-docker - Run the ocrad OCR engine in a docker container
- kraken-docker - Run the kraken OCR engine in a docker container
- Konfuzio - Free Online OCR up to 2.000 pages per month and OCR API by [@atraining], see https://youtu.be/NZKUrKyFVA8 (code is not open)
- ocr.space - Free Online OCR and OCR API by @a9t9 based on Tesseract (code is not open)
- OCR4all - Provides OCR services through web applications. Included Projects: LAREX, OCRopus, calamari and nashi.
OCR evaluation
-
ISRI OCR Evaluation Tools with a User Guide from 1996 :!:
- isri-ocr-evaluation-tools - further development by @eddieantonio (2015, 2016)
- ancientgreekocr-evaluation-tools - further development by @nickjwhite (2013, 2014)
- ocrevalUAtion - Cross-format evaluation, CLI and GUI
- ngram-ocr-eval - Brute and simple OCR evaluation using ngrams
- quack - Quality-Assurance-tool for scans with corresponding ALTO-files
OCR libraries by programming language
Crystal
- tesseract-ocr - A Crystal wrapper for tesseract-ocr.
Elixir
- tesseract_ocr - Elixir library wrapping the tesseract executable.
Go
- gosseract - Golang OCR library, wrapping Tesseract-ocr.
Java
- Tess4J - Java Native Access bindings to Tesseract.
- tess-two - Tools for compiling Tesseract on Android and Java API.
.Net
- tesseract for .net - A .Net wrapper for tesseract-ocr.
Object Pascal
- TTesseractOCR4 - Object Pascal binding for tesseract-ocr 4.x.
PHP
- Tesseract OCR for PHP - Tesseract PHP bindings.
Python
- pytesseract - A Python wrapper for Google Tesseract.
- pyocr - A Python wrapper for Tesseract and Cuneiform.
- ocrodjvu - A library and standalone tool for doing OCR on DjVu documents, wrapping Cuneiform, gocr, ocrad, ocropus and tesseract
- tesserocr - A Python wrapper for the tesseract-ocr API
Javascript
- ocracy - pure javascript lstm rnn implementation based on ocropus
- gocr.js - Javascript port (emscripten) of gocr
- ocrad.js - Javascript port (emscripten) of ocrad
- tesseract.js - Javascript port (emscripten) of Tesseract
- node-tesseract-ocr - A simple wrapper for the Tesseract OCR package.
- node-tesseract-native - C++ module for node providing OCR with tesseract and leptonica.
Ruby
- rtesseract - Ruby library wrapping the tesseract and imagemagick executables.
- ruby-tesseract - Native Tesseract bindings for Ruby MRI and JRuby
- ocr_space - API wrapper for free ocr service ocr.space. Includes CLI
Rust
- tesseract.rs - Rust bindings for tesseract OCR.
- leptess - Productive and safe Rust bindings/wrappers for tesseract and leptonica.
R
- tesseract - R bindings for tesseract OCR.
Swift
- Tesseract OCR iOS - Swift and Objective-C wrapper for Tesseract OCR.
- SwiftOCR - Fast and simple OCR library written in Swift. Optimized for recognizing short, one line long alphanumeric codes.
OCR training tools
- glyph-miner - A system for extracting glyphs from early typeset prints
- ocrodeg - Document image degradation for OCR data augmentation
Datasets
Ground Truth
-
archiscribe-corpus - >4,200 lines transcribed from 19th Century German prints via archiscribe
CC-BY 4.0
- CIS OCR Test Set - 2 example documents each in German/Latin/Greek with ground truth for PoCoTo
-
Rescribe - Transcriptions of Caroline Minuscule Manuscripts
PDM 1.0
-
CLTK - Corpora from Classical Language Toolkit
PDM 1.0
-
DIVA-HisDB - 150 pagesPAGE-XML of three medieval manuscripts
CC-BY-NC 3.0
-
EarlyPrintedBooks - ~8,800 lines from several early printed books
CC-BY-NC-SA 4.0
-
EEBO-TCP - 25,363 EEBO documents transcribed by TCP
PDM 1.0
-
ECCO-TCP - 2,188 ECCO documents transcribed by TCP
PDM 1.0
-
eMOP-TCP - 2,188 ECCO-TCP documents, cleaned up by eMOP
PDM 1.0
- Evans-TCP - 4,977 Evans documents transcribed by TCP
- FDHN - Finnish Digitised Historical Newspapers, Paper, (free) registration required, Terms of Use
-
FROC-MSS - 4 Old French Medieval Manuscripts
CC-BY 4.0
-
GERMANA - 764 Spanish manuscript pages, (free) registration required
non-commercial use only
-
GT4HistOCR - Ground Truth for German Fraktur and Early Modern Latin
CC-BY 4.0
- imagessan - Sanskrit images & ground truth (Devanagari script)
-
IMPACT-BHL - 2,418 pagesPAGE-XML from the Biodiversity Heritage Library, XML@GitHub
CC-BY 3.0
-
IMPACT-BL - 294 pagesPAGE-XML from the British Library, (free) registration required
PDM 1.0
-
IMPACT-BNE - 215 pagesPAGE-XML from the National Library of Spain, (free) registration required, XML@GitHub
CC-BY-NC-SA 4.0
-
IMPACT-BNF - 151 pagesPAGE-XML from the National Library of France, (free) registration required
CC-BY-NC-SA 4.0
-
IMPACT-KB - 142 pagesPAGE-XML from the National Library of the Netherlands
CC-BY 4.0
-
IMPACT-NKC - 187 pagesPAGE-XML from the Czech National Library, (free) registration required
CC-BY-NC-SA 4.0
-
IMPACT-NLB - 19 pagesPAGE-XML from the National Library of Bulgaria, (free) registration required
CC-BY-NC-ND 4.0
-
IMPACT-NUK - 209 pagesPAGE-XML from the National Library of Slovenia, (free) registration required
CC-BY-NC-SA 4.0
-
IMPACT-PSNC - 478 pagesPAGE-XML from four Polish digital libraries, XML@GitHub
CC-BY 3.0
- LascivaRoma/lexical - Transcription of 19th century lexical resources for Latin learning
- MJSynth - 9m synthetic images covering 90k English words
-
OCR19thSAC - 19,000 pages Swiss Alpine Club yearbooks transcribed via Text+Berg digital
CC-BY 4.0
-
OCR-D - 180 pagesPAGE-XML of German historical prints from OCR-D
CC-BY-SA 4.0
- OCR_GS_Data - Double-checked Arabic Gold Standard from OpenITI
-
old-books - 322 old books from Project Gutenberg
GPL 3.0
-
PRImA-ENP - 528 pagesPAGE-XML historic newspapers from Europeana Newspapers, (free) registration required
PDM 1.0
-
RODRIGO - 853 Spanish manuscript pages, (free) registration required
non-commercial use only
- Toebler-OCR - (Kraken) Ground Truth transcription of few pages of the Tobler-Lommatzsch: Altfranzösisches Wörterbuch
Literature
OCR-related publication and link lists
- IMPACT: Tools for text digitisation - List of tools software projects related, some related to OCR
- OCR-D - List of OCR-related academic articles in the context of the OCR-D project. :de:
- Mendeley Group "OCR - Optical Character Recognition" - Collection of 34 papers on OCR
- eadh.org projects - List of Digital Humanities-related projects in Europe, some related to OCR
- Wikipedia: Comparison of optical character recognition software
- OCR [and Deep Learning] by @handong1587
- Ocropus Wiki: Publications
Blog Posts and Tutorials
-
Tesseract Blends Old and New OCR Technology (2016) @theraysmith
- Tutorial@DAS2016, Updated "What You Always Wanted to Know" slides
-
What You Always Wanted To Know About Tesseract (2014) @theraysmith
- Tutorial@DAS2014, includes demos
- Extracting text from an image using Ocropus (2015)
- Training an Ocropus OCR model (2015) @danvk
- Ocropus Wiki: Compute errors and confusions (2016) @zuphilip
- Ocropus Wiki: Working with Ground Truth (2016) @zuphilip
-
OCRopus (2016) @jze
- mostly on column separation in ocropus
-
10 Tips for making your OCR project succeed (2013) @cneud
- general things to consider for OCR projects
-
Overview of LEADTOOLS Image Cleanup and Pre-processing SDK Technology -
- feature list for a commercial image pre-processing library; has nice before-after samples for pre-processing steps related to OCR
-
Extracting Text from PDFs; Doing OCR; all within R @shawngraham
- How to work with OCR from PDFs in the R programming environment
-
Tutorial: Command-line OCR on a Mac @bmschmidt
- Tutorial on how to run tesseract in Mac OSX
- Practical Expercience with OCRopus Model Training (2016) @jze
-
Homemade Manuscript OCR (1): OCRopy (2017) @Jean-Baptiste-Camps
- Tutorial on applying OCR to medieval manuscripts with OCRopy
- Optimizing Binarization for OCRopus (2017) @jze
- Prototype demo for OCR postfix in Danish Newspapers (2016) @thomasegense
- How Can I OCR My Dictionary? (2016) @JessedeDoes
- "Needlessly complex" blog (2016) @mzucker. Several image processing how-tos (Python based), particularly:
- (Open-Source-)OCR-Workflows (2017) @wrznr :de: overview of the state of the art in open source OCR and related technologies (binarisation, deskewing, layout recognition, etc.), lots of example images and information on the @OCR-D project.
- A gentle introduction to OCR (2018) @shgidi
- Worauf kann ich mich verlassen? Arbeiten mit digitalisierten Quellen, Teil 1: OCR (2019) @eliaskreyenbuehl :de: A reflection/criticism on OCR quality, OCR pitfalls in Fraktur fonts.
OCR Showcases
- abbyy-finereader-ocr-senate - Using OCR to parse scanned Senate Financial Disclosure forms.
- cvOCR - An OCR system for recognizing resume or cv text, implemented in Python and C and based on tesseract
- MathOCR - A printed scientific document recognition system, pre-alpha
Academic articles
2011 and before
- High performance document layout analysis (2003) Breuel
- Adaptive degraded document image binarization (2006) Gatos, Pratikakis, Perantonis
- [Internship Report] (2007) Gupta
- OCRopus Addons (Internship Report) (2007) Dantrey
2012
- Local Logistic Classifiers for Large Scale Learning (2012) Yousefi, Breuel
2013
- High Performance OCR for Printed English and Fraktur using LSTM Networks (2013) Breuel, Ul-Hasan, Mayce Al Azawi. Shafait
- Can we build language-independent OCR using LSTM networks? (2013) Ul-Hasan, Breuel
- Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks (2013) Ul-Hasan, Ahmed, Rashid, Shafait, Breuel
2014
- OCR of historical printings of Latin texts: Problems, Prospects, Progress. (2014) Springmann, Najock, Morgenroth, Schmid, Gotscharek, Fink
- Correcting Noisy OCR: Context beats Confusion (2014) Evershed, Fitch
2015
-
TypeWright: An Experiment in Participatory Curation (2015) Bilansky
- On crowd-sourcing OCR postcorrection
- Benchmarking of LSTM Networks (2015) Breuel
- Recognition of Historical Greek Polytonic Scripts Using LSTM (2015) Simistira, Ul-Hassan, Papavassiliou, Basilis Gatos, Katsouros, Liwicki
- A Segmentation-Free Approach for Printed Devanagari Script Recognition (2015) Karayil, Ul-Hasan, Breuel
- A Sequence Learning Approach for Multiple Script Identification (2015) Ul-Hasan, Afzal, Shfait, Liwicki, Breuel
2016
-
Important New Developments in Arabographic Optical Character Recognition (OCR) (2016) Romanov, Miller, Savant, Kiessling
- on kraken
- using OpenArabic/OCR_GS_Data for ground truth data
- OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus (2016) Springmann, Lüdeling
- Automatic quality evaluation and (semi-) automatic improvement of mixed models for OCR on historical documents (2016) Springmann, Fink, Schulz
- Generic Text Recognition using Long Short-Term Memory Networks (2016) Ul-Hasan -- Ph.D Thesis
- OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters (2016) Dengel, Ul-Hasan, Bukhari
- Recursive Recurrent Nets with Attention Modeling for OCR in the Wild (2016) Lee, Osindero
2017
-
Telugu OCR Framework using Deep Learning (2015/2017) Achanta, Hastie
- see also TeluguOCR, banti_telugu_ocr, chamanti_ocr, #49
2018
- A Two-Stage Method for Text Line Detection in Historical Documents (2018) Grüning, Leifert, Strauß, Labahn. Code available at https://github.com/TobiasGruening/ARU-Net