doctr [models] Pretrained artefact detection model isn't that robust

[models] Pretrained artefact detection model isn't that robust

Open fg-mindee opened this issue 2 years ago • 2 comments

The current pretrained artefact detection model was trained on a fully synthetic dataset. While this comes with several advantages, the dataset has a distribution that is still a bit far off from real-world data.

This is almost a product question but users can either use this on:

source PDF/documents
scanned documents
pictures of documents

While the model performs quite well on the first two usecases, the last one has some precision troubles sometimes (especially with the background):

To tackle this, perhaps we should improve the dataset or add nice augmentations that adds backgrounds instead of zero padding for geometric transforms (rotation, perspective, etc.)

What do you think @SiddhantBahuguna?

cc @fharper

Mar 17 '22 16:03 fg-mindee

Thanks for the issue FG :) I agree. More dataset for further fine tuning will greatly help :) One more thing, in addition, may be we can set some geometric restrictions (aspect ratio and relative area of the logo in particular with respect to the entire page). I had implemented that along with post-NMS and it did improve precision ( I am sorry, I dnt remember the perf increase exactly). So, to start from somewhere, I suggest, we have a dataset of around 2k more images for fine tuning ? That dataset may not be accessible to public because of restrictions though. In the meantime, I will work on the padding :) Will update you in 1~2 weeks on the same. Thanks!

Mar 17 '22 16:03 SiddhantBahuguna

@SiddhantBahuguna any update ? :)

May 24 '22 18:05 felixdittrich92

On user end with contrib module now

Apr 25 '24 16:04 felixdittrich92

doctr doctr copied to clipboard

[models] Pretrained artefact detection model isn't that robust

doctr
doctr copied to clipboard