marker icon indicating copy to clipboard operation
marker copied to clipboard

OCR Produces Non-Existent Text from Bleed-Through Artifacts

Open iansmirlis opened this issue 8 months ago • 5 comments
trafficstars

Not sure if this is a true issue but I thought it's worth investigating.

When processing scanned documents, the OCR model sometimes produces text that does not exist in the original image. Initially, this seemed like a hallucination issue, but after careful inspection, I noticed that the generated text (mostly numbers and random characters) corresponds to very faint artifacts from text appearing on the reverse side of the scanned page.

This issue seems to be caused by bleed-through, where slightly visible text from the back page is mistakenly recognized as actual foreground text. The issue gets worse, the larger the dpi.

I think the model is too sensitive to bleed-through. Which is no big deal, as the images can be preprocessed in this case or I could change the confidence threshold, but maybe some preprocessing can also be done by the pipeline, or train the ocr model to be less sensitive to such effects.

Thanks for the great project!


This is an example part of the page that produces nonsense:

Image

Produces:

ανδρείκελο «ομοίωμα ανθρώπου»

< αρχ. άνδρείκελον (ήδη τον 5ο αι. π.Χ. σε Πλάτωνα και Ξενοφώντα] < άνδρ(ο)- + -είκελον < επίθ. είκελος «όμοιος», για το οποίο βλ.λ. εικόνα.

ανδρείος -> ανδρας

ανδριαντας

< apx. άνδριάς, - άντος (ήδη μυκ. a-di-ri-ja-pi: *άνδριαφι, οργανική πληθ.) < άνδρίον, υποκορ. τού ανήρ, άνδρος: 1 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

11 16 21 11 -

大型 大发电影群 中国体育

MM #582KO X THE HATE

191:1 1818.

The state the complex of the results of the states

Color Colline

ανδρικός -> άνδρας

  • ανδρισμός -> άνδρας -> « » » » » « « « « «
  • ανδρώνω → άνδρας

ανε- στερητικό | | | | | | | | | | | | |

< μεσν. άνε-, που προέρχεται από επίθ. με άν- στερητ. όταν ακολουθούσε -ε- (π.χ. άν-έκδοτος, άνέλπιστος, άν-επίδεκτος), από όπου στη συνέχεια αυτονομήθηκε ως στερητ. μόρφημα (π.χ. μεσν. άνέγνωρος).

iansmirlis avatar Mar 04 '25 21:03 iansmirlis

Thanks for pointing this out! This is an issue with the new text detection model from the latest surya release. The model seems to be a little to sensitive towards the bleed through text. We'll patch this in the next release!

tarun-menta avatar Mar 05 '25 20:03 tarun-menta

Still getting this on the latest release

melyux avatar May 13 '25 08:05 melyux

@iansmirlis Could you please tell me how to examine confidence score of particular characters/words that have been OCR'ed? That is the confidence that you referred to? Thank you

kiselaTruba avatar Jul 09 '25 14:07 kiselaTruba

Will take a look again. I think we will need to add some filtering for these cases, and I'll put something into our next release. Sorry for the long wait! Gets tricky trying to balance this between model fixes, and post-processing

tarun-menta avatar Jul 14 '25 16:07 tarun-menta

@tarun-menta Hi, no I was asking for general case, not this specific situation; to get the OCR confidence in detected text. I've managed to extract it from Surya OCR character confidence, just few lines of code are needed in the Marker source code.

kiselaTruba avatar Jul 18 '25 10:07 kiselaTruba