tesstrain
tesstrain copied to clipboard
Is it possible to train a model for multiple types of sources?
I would like to know how the default model is trained. If it is trained with several images (if so, what order of magnitude), or if images are generated automatically with different sources.
I want to train a model of mine from the pattern and using images it seems that low resolution images the pattern model reads better even adding more and more dataset. I'm training with images varying DPI using characters, words and phrases. Should I be doing it differently?
We don't know exactly how the standard models were trained because that was done by Google. Only some hints are available.
But have you ever trained, or do you know of any case where, through a dataset of images, the assertiveness got to be greater than or equal to the standard model? This type of information is very scarce, I would like to have a north of the amount of a possible dataset to have a reasonably functional model.
Yes, we trained lots of models meanwhile. See https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR or https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR for examples.
Yes, we trained lots of models meanwhile. See https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR or https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR for examples.
Wow what a fuck! How have I not seen this before.
But I still have some doubts:
1 - I saw that you use xml, in the dataset. Is this xml just to extract the words and use them as png and .gt.txt or is the xml used with the whole image?
2 - What is the order of magnitude of the dataset that you guys usually use (100k, 1M, 10M)?
3 - Do you do a lot of data augmentation to improve reading?
- The lines must be extracted from the PAGE XML files, and the same must be done for the page images. See example with extracted lines. For other GT data you still have to do this extraction.
- That depends. reichsanzeiger-gt for example has 119435 lines, GT4HistOCR has 313173 lines, but there are also some smaller data sets.
- No data augmentation.
The last question, I swear.
Does it have much impact on assertiveness in training a model with multiple sources? Several images with different fonts, always keeping the proportion between them, of course.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.