fastdup
fastdup copied to clipboard
fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data oper...
FastDup | A tool for gaining insights from a large image collection
Large Image Datasets Today are a Mess | Blog Post | Video Tutorial
FastDup is a tool for gaining insights from a large image collection. It can find anomalies, duplicate and near duplicate images, clusters of similarity, learn the normal behavior and temporal interactions between images. It can be used for smart subsampling of a higher quality dataset, outlier removal, novelty detection of new information to be sent for tagging. FastDup scales to millions of images running on CPU only.
From the authors of GraphLab and Turi Create.
Compute Image Statistics
Compute image statistics and visualize the results, using food-101 dataset
Identify duplicates
Duplicates and near duplicates identified in MS-COCO and Imagenet-21K dataset
Find corrupted and broken images
Thousands of broken ImageNet images that have confusing labels of real objects.
Find outliers
IMDB-WIKI outliers (data goal is for face recognition, gender and age classification)
Find similar persons
Can you tell how many different persons?
Find wrong labels
Wrong labels in the Imagenet-21K dataset.
Find confusing labels report
Identify wrong / confusing labels using k-nearest neighbor visual classifier
Find image with contradicting labels
Cluster of wrong labels in the Imagenet-21K . No human can tell those red wines from their image.
Fun labels in the Imagenet-21K dataset
Coming soon: image graph search (please reach out if you like to beta test)
Upcoming new features: image graph search!
Results on Key Datasets (full results here)
We have thoroughly tested fastdup across various famous visual datasets. Ranging from pilar Academic datasets to Kaggle competitions. A key finding we have made using FastDup is that there are ~1.2M (!) duplicate images on the ImageNet-21K dataset, out of which 104K pairs belong both to the train and to the val splits (this amounts to 20% of the validation set). This is a new unknown result! Full results are below. * train/val splits are taken from https://github.com/Alibaba-MIIL/ImageNet21 .
Dataset | Total Images | cost [$] | spot cost [$] | processing [sec] | Identical pairs | Anomalies |
---|---|---|---|---|---|---|
imagenet21k-resized | 11,582,724 | 4.98 | 1.24 | 11,561 | 1,194,059 | Anomalies Wrong Labels |
imdb-wiki | 514,883 | 0.65 | 0.16 | 1,509 | 187,965 | View |
places365-standard | 2,168,460 | 1.01 | 0.25 | 2,349 | 93,109 | View |
herbarium-2022-fgvc9 | 1,050,179 | 0.69 | 0.17 | 1,598 | 33,115 | View |
landmark-recognition-2021 | 1,590,815 | 0.96 | 0.24 | 2,236 | 2,613 | View |
visualgenome | 108,079 | 0.05 | 0.01 | 124 | 223 | View |
iwildcam2021-fgvc9 | 261,428 | 0.29 | 0.07 | 682 | 54 | View |
coco | 163,957 | 0.09 | 0.02 | 218 | 54 | View |
sku110k | 11,743 | 0.03 | 0.01 | 77 | 7 | View |
- Experiments presented are on a 32 core Google cloud machine, with 128GB RAM (no GPU required).
- All experiments could be also reproduced on a 8 core, 32GB machine (excluding Imagenet-21K).
- We run on the full ImageNet-21K dataset (11.5M images) to compare all pairs of images in less than 3 hours WITHOUT a GPU (with Google cloud cost of 5$).
Quick Installation
For Python 3.7, 3.8, 3.9 (Ubuntu 20.04 or Ubuntu 18.04 or Debian 10 or Mac M1 or Mac Intel Mojave and up)
# upgrade pip to its latest version
python3.XX -m pip install -U pip
# install fastdup
python3.XX -m pip install fastdup
Where XX is your python version. For CentOS 7.X, RedHat 4.8 and other older Linux see our Insallation instructions.
Running the code
import fastdup
fastdup.run(input_dir="/path/to/your/folder", work_dir='out', nearest_neighbors_k=5, turi_param='ccthreshold=0.96') #main running function.
fastdup.create_duplicates_gallery('out/similarity.csv', save_path='.') #create a visual gallery of found duplicates
fastdup.create_outliers_gallery('out/outliers.csv', save_path='.') #create a visual gallery of anomalies
fastdup.create_components_gallery('out', save_path='.') #create visualiaiton of connected components
fastdup.create_stats_gallery('out', save_path='.', metric='blur') #create visualization of images stastics (for example blur)
fastdup.create_similarity_gallery('out', save_path='.',get_label_func=lambda x: x.split('/')[-2]) #create visualization of top_k similar images assuming data have labels which are in the folder name
fastdup.create_aspect_ratio_gallery('out', save_path='.') #create aspect ratio gallery
Working on the Food-101 dataset. Detecting identical pairs, similar-pairs (search) and outliers (non-food images..)
Getting started examples
- 🔥 Finding duplicates, outliers and connected components in the Food-101 dataset, including Tensorboard Projector visualization - Google Colab
- 🔥🔥 Visualizing and understanding a new dataset, looking at dats outliers and label outliers, Training a baseline KNN classifier and getting to accuracy of 0.99 by removing confusing labels
- Finding wrong lables via image similarity
- Computing image statistics
- Using your own onnx model for extractiom
- Getting started on a Kaggle dataset
- Finding duplicates, outliers in the Food-101 datadset:
- Analyzing video of the MEVA dataset - Google Colab
- Kaggle notebook - visualizing the pistachio dataset
Detailed instructions
- Detailed instructions, install from stable release and installation issues
- Detailed running instructions
User community contributions
*FsstDup based Anime Search Engine by Dorothy Walker
Support
Technology
We build upon several excellent open source tools. Microsoft's ONNX Runtime, Facebook's Faiss, Open CV, Pillow Resize, Apple's Turi Create, Minio, Amazon's awscli, TensorBoard, scikit-learn.