cinic-10 icon indicating copy to clipboard operation
cinic-10 copied to clipboard

Duplicated and invalid images

Open zhanyuanucb opened this issue 4 years ago • 0 comments

Hi! I love your dataset and I think it is very helpful. However, I found that there are quite amount of invalid and duplicated images. The invalid images I found are all in class bird and class frog, and they look like: invalid

Here is my results:

Class: airplane
Number of images: 9000
Number of duplicates in class airplane: 121
Class: truck
Number of images: 9000
Number of duplicates in class truck: 26
Class: bird
Number of images: 9000
Number of duplicates in class bird: 24
Class: automobile
Number of images: 9000
Number of duplicates in class automobile: 12
Class: horse
Number of images: 9000
Number of duplicates in class horse: 80
Class: cat
Number of images: 9000
Number of duplicates in class cat: 27
Class: deer
Number of images: 9000
Number of duplicates in class deer: 139
Class: frog
Number of images: 9000
Number of duplicates in class frog: 319
Class: ship
Number of images: 9000
Number of duplicates in class ship: 22
Class: dog
Number of images: 9000
Number of duplicates in class dog: 25

Here is my code:

import hashlib
import os
import os.path as osp
from imageio import imread
from PIL import Image
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
import matplotlib.gridspec as gridspec
import time 
import numpy as np 


# Reference: https://medium.com/@urvisoni/removing-duplicate-images-through-python-23c5fdc7479e
def file_hash(filepath):
    with open(filepath, 'rb') as f:
        return hashlib.md5(f.read()).hexdigest()

duplicates = []
num_duplicate = 0
hash_keys = set()
root = "/data/CINIC10/train"
file_list = os.listdir(root)
len(file_list)
for classname in os.listdir(root):
    print(f"Class: {classname}")
    class_dir = osp.join(root, classname)
    class_list = os.listdir(class_dir)
    print(f"Number of images: {len(class_list)}")
    for index, filename in enumerate(class_list):
        filename = osp.join(class_dir, filename)
        if os.path.isfile(filename):
            filehash = file_hash(filename)
            if filehash not in hash_keys:
                hash_keys.add(filehash)
            else:
                duplicates.append((classname, filename))
        else:
            print(f"{filename} not a file")
            break
    print(f"Number of duplicates in class {classname}: {len(duplicates) - num_duplicate}")
    num_duplicate = len(duplicates)

zhanyuanucb avatar Jun 23 '20 17:06 zhanyuanucb