ocrd_all icon indicating copy to clipboard operation
ocrd_all copied to clipboard

ocrd resmgr same comment for 2 tesseract models

Open jbarth-ubhd opened this issue 11 months ago • 3 comments

  • frak2021.traineddata (https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_best/frak2021-0.905.traineddata) Tesseract LSTM model based on Austrian National Library newspaper data

  • ONB.traineddata (https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/ONB/tessdata_best/ONB_1.195_300718_989100.traineddata) Tesseract LSTM model based on Austrian National Library newspaper data

jbarth-ubhd avatar Feb 28 '24 15:02 jbarth-ubhd

The 2nd comment is correct, the 1st comment is incomplete: while ONB only used the ground truth from Austrian Newspapers for the training, frak2021 also used additional ground truth (GT4HistOCR and more).

In addition, frak2021 used a newer version of Austrian Newspapers, so the quality of the training data was better. Side note: german_print from 2023/2024 also uses a mix of ground truth data, but even more and newer one than frak2021.

I suggest to update the comment to "Tesseract LSTM model based on a mix of mostly German and Latin ground truth data".

stweil avatar Feb 28 '24 16:02 stweil

The fix is required for https://github.com/OCR-D/ocrd_tesserocr/blob/master/ocrd_tesserocr/ocrd-tool.json. Therefore the issue should be moved to ocrd_tesserocr. @kba, I don't have the necessary rights to do that.

stweil avatar Feb 28 '24 16:02 stweil