OSS-DocumentScanner
OSS-DocumentScanner copied to clipboard
[BUG] OCR for German language quite inaccurate
Which app is your issue for
Document Scanner
Version
1.14.5 Build 121
What platform are you using?
Android
OS Version
GrapheneOS latest
What happened?
OCR text extracted from the scanned documents is not of "good" quality. I used the OCR "best" version and as language "German". The text created is not as good as I am used from other Document Scanner Apps. There are unwanted spaces, misspelled words, special character, wrong words, etc. This makes it hard to search for text within PDF documents later on.
Maybe it is possible to add another OCR system or enhance the German version. Thank you very much.
Relevant log output
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
@drp4positive i am sorry to hear that. It relies on tesseract for ocr recognition. This is the only (to.my knowledge) good enough OCR for all langaugaes But indeed it might be worst for some. Though it should be pretty much as good as for English or French. You can create an issue on their repo to see what they think of it.
Thanks @farfromrefug I used better light (daylight) to take photos and the OCR result is better, but not as good as I was used to from my old app. But I am happy to use a FOSS product and I will take your idea into consideration and let the people of tesseract know.
Greetings
The same situation with OCR of texts in Russian
OSS Document Scanner 1.14.5.121 (2025-02-17):
ABBYY FineReader PDF 16.0.14.6564; part 1435.8:
PDF-XChange Editor 10.7.1, build 399 (Enhanced OCR ≡ ABBYY OCR 12):
I understand that Tesseract is pretty bad at recognizing non-English texts with imperfect letter outlines. But maybe it makes sense to think about using cloud services API based on neural networks? They currently recognize even handwritten texts very well, unlike traditional OCR engines that require many hours of training for each handwriting, and still recognize handwritten text poorly.
@Korb @drp4positive i think this is because Cyrillic chars are not printed. You can try this build (github/fdroid build with sentry enabled) https://github.com/Akylas/OSS-DocumentScanner/releases/tag/webdav_test Report if it is better
OCR settings in both cases:
OSS Document Scanner 1.14.5.121 (2025-02-17):
OSS Document Scanner 1.14.5.121 (2025-09-12):
Clearly improved OCR quality!
@Korb What different options have you used for the better result and for the not so good result? Thanks
What different options have you used for the better result and for the not so good result?
OCR settings in both cases: Quality: Best Languages: Russian
Or do you mean desktop apps' settings?
Improvements come from updated tesseract and cyrillic chars rendering