doctr
doctr copied to clipboard
[models] Pretrained artefact detection model isn't that robust
The current pretrained artefact detection model was trained on a fully synthetic dataset. While this comes with several advantages, the dataset has a distribution that is still a bit far off from real-world data.
This is almost a product question but users can either use this on:
- source PDF/documents
- scanned documents
- pictures of documents
While the model performs quite well on the first two usecases, the last one has some precision troubles sometimes (especially with the background):
To tackle this, perhaps we should improve the dataset or add nice augmentations that adds backgrounds instead of zero padding for geometric transforms (rotation, perspective, etc.)
What do you think @SiddhantBahuguna?
cc @fharper
Thanks for the issue FG :) I agree. More dataset for further fine tuning will greatly help :) One more thing, in addition, may be we can set some geometric restrictions (aspect ratio and relative area of the logo in particular with respect to the entire page). I had implemented that along with post-NMS and it did improve precision ( I am sorry, I dnt remember the perf increase exactly). So, to start from somewhere, I suggest, we have a dataset of around 2k more images for fine tuning ? That dataset may not be accessible to public because of restrictions though. In the meantime, I will work on the padding :) Will update you in 1~2 weeks on the same. Thanks!
@SiddhantBahuguna any update ? :)
On user end with contrib module now