IMDb-Face icon indicating copy to clipboard operation
IMDb-Face copied to clipboard

Repetitions in file names and class labels

Open swghosh opened this issue 4 years ago • 1 comments

It has been observed that the CSV file which is used to download the dataset consists of a few repetitions in terms of URL values (maybe intentional because a simple picture may contain lot of faces); and the assigned class labels for few celebrity name.

The following are referential to two different celebrities, yet possess the same class index.

  • Kanchan - nm0437156
  • Ilias_Kanchan - nm0437156

Apart from that there are a few entries in the dataset that are pure repetition of entries such that each individual entry possesses the same class index, filename, URL pair. (assuming that the format {class_index}_{filename.jpg} should mark a unique entry)

Hope this helps! Alternatively, please do let me know I was mistaken and those were on purpose like that.

Sample code to reproduce the problem.

import csv
file_a = open('IMDb-Face.csv', 'r')
spreadsheet = csv.DictReader(file_a)
entries = ['%s_%s' % (entry['index'], entry['image']) for entry in spreadsheet]
print(len(entries), 'entries were found.')
unique_entries = set(entries)
print(len(unique_entries), 'unique entries were found.')
+ 1662888 entries were found.
- 1632927 unique entries were found.

swghosh avatar Aug 20 '19 20:08 swghosh

I downloaded dataset and it looks like "Kanchan" class is trash or error while "Ilias_Kanchan" is real class.

image

Apich238 avatar Sep 11 '23 13:09 Apich238