datumaro icon indicating copy to clipboard operation
datumaro copied to clipboard

Allow subsets and recursive search in ImageNet

Open zhiltsov-max opened this issue 3 years ago • 2 comments

An ImageNet dataset can have any of the following layouts:

# Single subset, no relative paths
/n234543/image1.jpg
/n343523/image2.jpg

# Multiple subsets, no relative paths
/train/n345345/image1.jpg
/test/n456456/image2.jpg

# Single subset, relative paths
/n234543/a/b/c/image1.jpg
/n343523/d/e/f/image2.jpg

# Multiple subsets, relative paths
/train/n345345/a/b/c/image1.jpg
/test/n456456/d/e/f/image2.jpg

Optionally, images can have a prefix with label name.

It is quite easy to see that there is no practical way to distinguish between different layouts (2 and 3, 3 and 4 etc.).

However, if there is a file, which describes dataset labels (e.g. synsets.txt with key - label name pairs or just label names), the situation is the following:

# n234543 cat
# n343523 dog
# ...
/synsets.txt

# Single subset, no relative paths
/n234543/image1.jpg
/n343523/image2.jpg
# ^^^^^ 
# 1. Either a subset or a label name
# 2. In the list of label names, so it's a label

# Multiple subsets, no relative paths
/train/n345345/image1.jpg
/test/n456456/image2.jpg
# ^^
# 1. Either a subset or a label name
# 2. Not in the list of label names, so it's a subset
# 3. The next is subset name, in the list of labels (*1)

# Single subset, relative paths
/n234543/a/b/c/image1.jpg
/n343523/d/e/f/image2.jpg
# ^^^^^
# The same as the 1st case

# Multiple subsets, relative paths
/train/n345345/a/b/c/image1.jpg
/test/n456456/d/e/f/image2.jpg
# ^^^^^
# The same as the 2nd case

*1 - In general, Datumaro considers recursive search for a dataset (BTW, maybe we need to limit such option only for detection and drop it for loading). So, given the following situations:

/somedir/n234543/image1.jpg
/somedir/n343523/image2.jpg

/somedir/train/n234543/image1.jpg
/somedir/test/n343523/image2.jpg

In both cases we can't distinguish whether is is a dataset already or just a directory. But, if there is a labels file:

/somedir/synsets.txt
/somedir/n234543/image1.jpg
/somedir/n343523/image2.jpg

/somedir/synsets.txt
/somedir/train/n234543/image1.jpg
/somedir/test/n343523/image2.jpg

Cases are clearly distinguishable.

The remaining question is: what to do when there is no such file? Well, let's use the default list from the original dataset. Therefore, all the ImageNet-like dataset will have to include an extra file, when labels are not standard. And such list is included in the original format.

zhiltsov-max avatar Oct 07 '21 16:10 zhiltsov-max

@IRDonch, @yasakova-anastasia, what's about AC?

zhiltsov-max avatar Oct 07 '21 17:10 zhiltsov-max

I don't think AC supports this format at all.

IRDonch avatar Oct 07 '21 17:10 IRDonch