nutrify icon indicating copy to clipboard operation
nutrify copied to clipboard

Clean/remove duplicate images with `fastdup`

Open mrdbourke opened this issue 2 years ago • 3 comments

Make a script to clean and remove duplicate images with fastdup - https://github.com/visual-layer/fastdup

  • This works well since they did a test across ImageNet21k (millions of images) and it worked in ~3 hours
  • Could run this script periodically to clean images whenever new images are downloaded

mrdbourke avatar Jan 23 '23 21:01 mrdbourke

Did this with a notebook and removed 695/25000 (or there abouts) images, saw a slight reduction in performance but this was expected due to less data leakage between train & test sets, see the evaluation run: https://wandb.ai/mrdbourke/test_wandb_artifacts_by_reference/runs/714m0crl

mrdbourke avatar Jan 24 '23 03:01 mrdbourke

Original notes (from #50) -

  • Found a library to help with image duplication thanks to hashing — https://github.com/idealo/imagededup Removing duplicates will help make the model more robust and prevent data from leaking from train → test set (and then giving false metrics)
  • Created a small notebook for this (07_remove_duplicates.ipynb) and it seems to work very well, found ~500/24500 images were duplicates in a few minutes and there were little samples that weren’t (after a series of quick random plots)
  • Could integrate this workflow to run over all the images every so often (or whenever new data is added to the dataset).

mrdbourke avatar Jan 24 '23 03:01 mrdbourke

Next will be to turn the notebook version of this into a script

mrdbourke avatar Jan 30 '23 05:01 mrdbourke