ocrd_all
ocrd_all copied to clipboard
ocrd resmgr same comment for 2 tesseract models
-
frak2021.traineddata (https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_best/frak2021-0.905.traineddata) Tesseract LSTM model based on Austrian National Library newspaper data
-
ONB.traineddata (https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/ONB/tessdata_best/ONB_1.195_300718_989100.traineddata) Tesseract LSTM model based on Austrian National Library newspaper data
The 2nd comment is correct, the 1st comment is incomplete: while ONB only used the ground truth from Austrian Newspapers for the training, frak2021 also used additional ground truth (GT4HistOCR and more).
In addition, frak2021 used a newer version of Austrian Newspapers, so the quality of the training data was better. Side note: german_print from 2023/2024 also uses a mix of ground truth data, but even more and newer one than frak2021.
I suggest to update the comment to "Tesseract LSTM model based on a mix of mostly German and Latin ground truth data".
The fix is required for https://github.com/OCR-D/ocrd_tesserocr/blob/master/ocrd_tesserocr/ocrd-tool.json. Therefore the issue should be moved to ocrd_tesserocr. @kba, I don't have the necessary rights to do that.