tesstrain icon indicating copy to clipboard operation
tesstrain copied to clipboard

Training Tesseract OCR for a specific document

Open mumarsyal opened this issue 1 year ago • 4 comments

I have recently started learning and experimenting with Tesseract OCR. I have done a training for a new font using the tesstrain.

Now my use case is that I want to train Tesseract 5 for a specific document attached below.

Ptcl_bill_0000

I have found some articles and tutorials about training for new font or new language but I couldn't find something about training for a custom document.

Is it possible to train Tesseract 5 for my document? If yes, please give me some guidelines on how to proceed with this and if I need any other tools other than Tesseract itself to prepare training data.

I have Tesseract 5 installed on Ubuntu 22.04.

mumarsyal avatar Nov 07 '23 11:11 mumarsyal

Could you please elaborate on what you are trying to achieve by training a specific document (type)? What do you expect to change compared to using the existing models?

stefan6419846 avatar Nov 07 '23 11:11 stefan6419846

Thank you for your response @stefan6419846 .

I ran Tesseract default English model on this image and the output is very bad. So, I want to train Tesseract specifically for this document to improve the output but I don't know how I can generate the training dataset(line images, *.gt.txt & box files) from these images. If you could suggest me some tools to create the dataset from these images, that would be wonderful.

mumarsyal avatar Nov 07 '23 11:11 mumarsyal

I have not tried it, but I would argue that better preprocessing on your side (feeding Tesseract with specific ROIs with appropriate preprocessing per ROI instead of the whole page, ...) might be easier and sufficient.

stefan6419846 avatar Nov 08 '23 07:11 stefan6419846

Thank you for your response @stefan6419846 .

I ran Tesseract default English model on this image and the output is very bad. So, I want to train Tesseract specifically for this document to improve the output but I don't know how I can generate the training dataset(line images, *.gt.txt & box files) from these images. If you could suggest me some tools to create the dataset from these images, that would be wonderful.

hello,maybe you can use jtessboxeditor.but it is heavy workload.

linxyu1 avatar Nov 15 '23 02:11 linxyu1