KabalaOCR
KabalaOCR copied to clipboard
Kabala Project OCR Training helpers
KabalaOCR
Tesseract langauge file for SuperMarket receipts :)
Setup a Tesseract environment
Ubuntu
You will have to build the latest version of tesseract from source (version 3), since the aptitude repositories hold an old version (v2.04).
- Download tesseract-3.01.tar.gz from http://code.google.com/p/tesseract-ocr/downloads/list
- Unpack it. (
tar -zvxf tesseract-3.01.tar.gz) - Before proceeding to the next step, make sure you have the following packages:
sudo apt-get install autoconf automake libtoolsudo apt-get install libpng12-devsudo apt-get install libjpeg62-devsudo apt-get install libtiff4-devsudo apt-get install zlib1g-devsudo apt-get install libleptonica-dev
- Compile tesseract:
./autogen.sh./configureNOTE: In case you will get an error about leptonica, you will have to do the following:sudo apt-get remove libleptonica-devwget http://www.leptonica.com/source/leptonica-1.68.tar.gz- After installing leptonica, try running
./configurefor tesseract again.
makesudo make installsudo ldconfig
- Download the english language file from http://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.01.eng.tar.gz and unpack it into /usr/local/share/tessdata
Mac OSX
Oddly enough, homebrew hosts a newer version of tesseract (v3.0), which is good enough for us.
brew install tesseract
Interface & Commands
The workflow is divided into two phases, the first one is manual and the second we automate:
-
Phase 1
- Create an image of your receipt
- Convert the image to a b&w tif file
- Save the image under the receipts directory in its proper place
-
Phase 2
- Generate a box file by running:
node create_box_file.js tif_file_name - Fix the boxing results
- Run:
coffee runner_command.coffee [reciepts dir path]
- Generate a box file by running: