nutrify
nutrify copied to clipboard
Clean/remove duplicate images with `fastdup`
Make a script to clean and remove duplicate images with fastdup - https://github.com/visual-layer/fastdup
- This works well since they did a test across ImageNet21k (millions of images) and it worked in ~3 hours
- Could run this script periodically to clean images whenever new images are downloaded
Did this with a notebook and removed 695/25000 (or there abouts) images, saw a slight reduction in performance but this was expected due to less data leakage between train & test sets, see the evaluation run: https://wandb.ai/mrdbourke/test_wandb_artifacts_by_reference/runs/714m0crl
Original notes (from #50) -
- Found a library to help with image duplication thanks to hashing — https://github.com/idealo/imagededup Removing duplicates will help make the model more robust and prevent data from leaking from train → test set (and then giving false metrics)
- Created a small notebook for this (07_remove_duplicates.ipynb) and it seems to work very well, found ~500/24500 images were duplicates in a few minutes and there were little samples that weren’t (after a series of quick random plots)
- Could integrate this workflow to run over all the images every so often (or whenever new data is added to the dataset).
Next will be to turn the notebook version of this into a script