gImageReader
gImageReader copied to clipboard
GUI to assist in fine tuning/teaching Tesseract on scanned images
It would be nice to have GUI elements that would assist in fine tuning/teaching Tesseract on scanned images. Similar to what jTessBoxEditor does, as described in this article[^*]. Mainly creating the .tiff and .box files...
[^*]: not all the commands listed in the article worked for me. Here are those corrected by me a bit:
java -jar jTessBoxEditor.jar
tesseract --psm 6 --oem 3 font_name.font.exp0.tif font_name.font.exp0 makebox
nano font_properties
font 0 0 0 0 0
# Create a .tr file (training file)
tesseract font_name.font.exp0.tif font_name.font.exp0 nobatch box.train
# Create a unicharset file
unicharset_extractor font_name.font.exp0.box
# Create a shapetable file
shapeclustering -F font_properties -U unicharset -O font_name.unicharset font_name.font.exp0.tr
# Create a pffmtable, intemp file
mftraining -F font_properties -U unicharset -O font_name.unicharset font_name.font.exp0.tr
# Create a normproto file
cntraining font_name.font.exp0.tr
mv shapetable font_name.shapetable
mv normproto font_name.normproto
mv pffmtable font_name.pffmtable
mv inttemp font_name.inttemp
combine_tessdata font_name.
Now copy font_name.traineddata to :
sudo cp font_name.traineddata /usr/share/tesseract-ocr/4.00/tessdata/
Now test new traindata:
tesseract test_numbers.png stdout -l font_name
Yes, this is one of basic features neccesary for OCR program. If it will get added I can donate to support development. Just make simple gui to modify tesseract configuration file with short description of parameter on hover.
Probably the fastest way to achieve this is if someone contributed the code via PR. On my part I won't have the capacity to work on this in the near future.
I created a simple Python script that extracts the boxes from the HTML file. In gImageReader you should export the edited image as HTML and then use the script to extract the boxes: https://github.com/khashashin/chechen_ocr