OSS-DocumentScanner icon indicating copy to clipboard operation
OSS-DocumentScanner copied to clipboard

[BUG] OCR for German language quite inaccurate

Open drp4positive opened this issue 7 months ago • 8 comments

Which app is your issue for

Document Scanner

Version

1.14.5 Build 121

What platform are you using?

Android

OS Version

GrapheneOS latest

What happened?

OCR text extracted from the scanned documents is not of "good" quality. I used the OCR "best" version and as language "German". The text created is not as good as I am used from other Document Scanner Apps. There are unwanted spaces, misspelled words, special character, wrong words, etc. This makes it hard to search for text within PDF documents later on.

Maybe it is possible to add another OCR system or enhance the German version. Thank you very much.

Relevant log output


Code of Conduct

  • [x] I agree to follow this project's Code of Conduct

drp4positive avatar Apr 30 '25 20:04 drp4positive

@drp4positive i am sorry to hear that. It relies on tesseract for ocr recognition. This is the only (to.my knowledge) good enough OCR for all langaugaes But indeed it might be worst for some. Though it should be pretty much as good as for English or French. You can create an issue on their repo to see what they think of it.

farfromrefug avatar May 02 '25 14:05 farfromrefug

Thanks @farfromrefug I used better light (daylight) to take photos and the OCR result is better, but not as good as I was used to from my old app. But I am happy to use a FOSS product and I will take your idea into consideration and let the people of tesseract know.

Greetings

drp4positive avatar May 02 '25 19:05 drp4positive

The same situation with OCR of texts in Russian

OSS Document Scanner 1.14.5.121 (2025-02-17): Image

ABBYY FineReader PDF 16.0.14.6564; part 1435.8: Image

PDF-XChange Editor 10.7.1, build 399 (Enhanced OCR ≡ ABBYY OCR 12): Image

I understand that Tesseract is pretty bad at recognizing non-English texts with imperfect letter outlines. But maybe it makes sense to think about using cloud services API based on neural networks? They currently recognize even handwritten texts very well, unlike traditional OCR engines that require many hours of training for each handwriting, and still recognize handwritten text poorly.

Korb avatar Sep 11 '25 12:09 Korb

@Korb @drp4positive i think this is because Cyrillic chars are not printed. You can try this build (github/fdroid build with sentry enabled) https://github.com/Akylas/OSS-DocumentScanner/releases/tag/webdav_test Report if it is better

farfromrefug avatar Sep 12 '25 12:09 farfromrefug

OCR settings in both cases:

Image

OSS Document Scanner 1.14.5.121 (2025-02-17): Image

OSS Document Scanner 1.14.5.121 (2025-09-12): Image

Clearly improved OCR quality!

Korb avatar Sep 12 '25 15:09 Korb

@Korb What different options have you used for the better result and for the not so good result? Thanks

drp4positive avatar Sep 12 '25 19:09 drp4positive

What different options have you used for the better result and for the not so good result?

OCR settings in both cases: Quality: Best Languages: Russian

Or do you mean desktop apps' settings?

Korb avatar Sep 13 '25 06:09 Korb

Improvements come from updated tesseract and cyrillic chars rendering

farfromrefug avatar Sep 13 '25 06:09 farfromrefug