doctr Duplicate words in OCR result

🐛 Bug

Running the sample code:

from doctr.documents import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True)
# PDF
doc = DocumentFile.from_images(["table.png"])
# Analyze
result = model(doc)

result.show(doc)

I get this result:

Everything looks fine but there is some overlap between different words. The mouse is pointing to the word "Header4" and there is another word with the content "4". In that case I'm not able to reconstruct properly the table header as there is either an extra "4".

To Reproduce

Steps to reproduce the behavior:

download this image

Run the following code

from doctr.documents import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True)
# PDF
doc = DocumentFile.from_images(["table.png"])
# Analyze
result = model(doc)

result.show(doc)

Jun 26 '21 10:06 jonathanMindee

I think using some overlap detection postprocessing it's possible to filter out those duplicates.

Jun 26 '21 10:06 jonathanMindee

Thanks for reporting this!

I'm not sure which way would be the best, but here are some ideas to handle this:

Batch post-processing: NMS to perform with a looser threshold.
Manual post-processing: estimate candidate overlaps with a box IoU. For pairs where there is a text overlap as well, we perform a manual NMS (taking the one with the longest string while having the confidence above a given threshold). The probable issue would be that the predicted resulting string will wrongly not include the blank space.
Training-based: we add blank space as part of the vocab in the recognition and use NMS.

The first option being natively implemented in most modern DL frameworks, it might be a suitable option to try first

Jun 26 '21 12:06 fg-mindee

I think we shouldn't only perform NMS, because here for instance we want to keep both boxes when there is an overlap. I see 2 solutions:

Merging the 2 boxes in 1 box, it is quick an easy but it can include undesirable spaces.
Arbitrarily shorten one of the 2 boxes to eliminate overlapping.

It is however an uncommon edge case, I think it only happens with underscores

Jun 28 '21 08:06 charlesmindee

As a matter of fact, we do want to suppress very small boxes included in other ones, so I suggest the following:

performing NMS with a very high threshold (let's say > 80%) to filter boxes covered by other ones (avoid repetitions without loosing information).
merging boxes with a consistent overlapping but with a lower IOU (for instance, IOU between 20% & 80%), to keep all the information we need.

This overlapping seems to be mostly frequent with underscores, so I think it is a good approximation to merge boxes in that case (technically, it is the same word). What do you think @fg-mindee ?

ex1 png

Jun 28 '21 10:06 charlesmindee

@charlesmindee Thanks for the suggestion! However when I suggested an NMS, I thinking about the iterative merging implementation of it So I fully agree that pure filtering won't be enough. As you mentioned, we might need to use another metric than IoU :+1:

Jun 28 '21 17:06 fg-mindee

Coming back to this issue, I suggest the following:

Investigate the heatmap of the text detection module to assess whether this comes from the segmentation or box conversion part (I'm especially interested in the overlapping localization candidates shown on the issue description image)
discuss options to handle the situation depending on our findings
as shown earlier, NMS isn't really the best option here since we're talking about small IoU overlaps. So if we tweak this NMS, that will start merging words that are correctly separated by a blank space

But let's not leave this issue unaddressed :smiley:

Dec 10 '21 13:12 fg-mindee

@frgfm @charlesmindee @odulcy-mindee

Seems to be solved with preserve_aspect_ratio=True. (Both TF and PT are identically) I have tested some personal documents and keeping the aspect ratio was always the better choice ... Should we use it by default wdyt ?

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True, preserve_aspect_ratio=True)
# PDF
doc = DocumentFile.from_images(["/home/felix/Desktop/table.png"])
# Analyze
result = model(doc)

result.show(doc)

Screenshot from 2023-07-25 08-15-10

Jul 25 '23 06:07 felixT2K

Hi @felixdittrich92, thanks for the suggestion, I think we can change the default behaviour since it is quite natural to preserve the aspect ratio by default. Moreover, it will make the predictions robuster to cropping.

Jul 31 '23 12:07 charlesmindee

doctr doctr copied to clipboard

Duplicate words in OCR result

🐛 Bug

To Reproduce

doctr
doctr copied to clipboard