marker
marker copied to clipboard
OCR Produces Non-Existent Text from Bleed-Through Artifacts
Not sure if this is a true issue but I thought it's worth investigating.
When processing scanned documents, the OCR model sometimes produces text that does not exist in the original image. Initially, this seemed like a hallucination issue, but after careful inspection, I noticed that the generated text (mostly numbers and random characters) corresponds to very faint artifacts from text appearing on the reverse side of the scanned page.
This issue seems to be caused by bleed-through, where slightly visible text from the back page is mistakenly recognized as actual foreground text. The issue gets worse, the larger the dpi.
I think the model is too sensitive to bleed-through. Which is no big deal, as the images can be preprocessed in this case or I could change the confidence threshold, but maybe some preprocessing can also be done by the pipeline, or train the ocr model to be less sensitive to such effects.
Thanks for the great project!
This is an example part of the page that produces nonsense:
Produces:
ανδρείκελο «ομοίωμα ανθρώπου»
< αρχ. άνδρείκελον (ήδη τον 5ο αι. π.Χ. σε Πλάτωνα και Ξενοφώντα] < άνδρ(ο)- + -είκελον < επίθ. είκελος «όμοιος», για το οποίο βλ.λ. εικόνα.
ανδρείος -> ανδρας
ανδριαντας
< apx. άνδριάς, - άντος (ήδη μυκ. a-di-ri-ja-pi: *άνδριαφι, οργανική πληθ.) < άνδρίον, υποκορ. τού ανήρ, άνδρος: 1 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
11 16 21 11 -
大型 大发电影群 中国体育
MM #582KO X THE HATE
191:1 1818.
The state the complex of the results of the states
Color Colline
ανδρικός -> άνδρας
- ανδρισμός -> άνδρας -> « » » » » « « « « «
- ανδρώνω → άνδρας
ανε- στερητικό | | | | | | | | | | | | |
< μεσν. άνε-, που προέρχεται από επίθ. με άν- στερητ. όταν ακολουθούσε -ε- (π.χ. άν-έκδοτος, άνέλπιστος, άν-επίδεκτος), από όπου στη συνέχεια αυτονομήθηκε ως στερητ. μόρφημα (π.χ. μεσν. άνέγνωρος).
Thanks for pointing this out! This is an issue with the new text detection model from the latest surya release. The model seems to be a little to sensitive towards the bleed through text. We'll patch this in the next release!
Still getting this on the latest release
@iansmirlis Could you please tell me how to examine confidence score of particular characters/words that have been OCR'ed? That is the confidence that you referred to? Thank you
Will take a look again. I think we will need to add some filtering for these cases, and I'll put something into our next release. Sorry for the long wait! Gets tricky trying to balance this between model fixes, and post-processing
@tarun-menta Hi, no I was asking for general case, not this specific situation; to get the OCR confidence in detected text. I've managed to extract it from Surya OCR character confidence, just few lines of code are needed in the Marker source code.