datumaro
datumaro copied to clipboard
Allow subsets and recursive search in ImageNet
An ImageNet dataset can have any of the following layouts:
# Single subset, no relative paths
/n234543/image1.jpg
/n343523/image2.jpg
# Multiple subsets, no relative paths
/train/n345345/image1.jpg
/test/n456456/image2.jpg
# Single subset, relative paths
/n234543/a/b/c/image1.jpg
/n343523/d/e/f/image2.jpg
# Multiple subsets, relative paths
/train/n345345/a/b/c/image1.jpg
/test/n456456/d/e/f/image2.jpg
Optionally, images can have a prefix with label name.
It is quite easy to see that there is no practical way to distinguish between different layouts (2 and 3, 3 and 4 etc.).
However, if there is a file, which describes dataset labels (e.g. synsets.txt
with key - label name pairs or just label names), the situation is the following:
# n234543 cat
# n343523 dog
# ...
/synsets.txt
# Single subset, no relative paths
/n234543/image1.jpg
/n343523/image2.jpg
# ^^^^^
# 1. Either a subset or a label name
# 2. In the list of label names, so it's a label
# Multiple subsets, no relative paths
/train/n345345/image1.jpg
/test/n456456/image2.jpg
# ^^
# 1. Either a subset or a label name
# 2. Not in the list of label names, so it's a subset
# 3. The next is subset name, in the list of labels (*1)
# Single subset, relative paths
/n234543/a/b/c/image1.jpg
/n343523/d/e/f/image2.jpg
# ^^^^^
# The same as the 1st case
# Multiple subsets, relative paths
/train/n345345/a/b/c/image1.jpg
/test/n456456/d/e/f/image2.jpg
# ^^^^^
# The same as the 2nd case
*1 - In general, Datumaro considers recursive search for a dataset (BTW, maybe we need to limit such option only for detection and drop it for loading). So, given the following situations:
/somedir/n234543/image1.jpg
/somedir/n343523/image2.jpg
/somedir/train/n234543/image1.jpg
/somedir/test/n343523/image2.jpg
In both cases we can't distinguish whether is is a dataset already or just a directory. But, if there is a labels file:
/somedir/synsets.txt
/somedir/n234543/image1.jpg
/somedir/n343523/image2.jpg
/somedir/synsets.txt
/somedir/train/n234543/image1.jpg
/somedir/test/n343523/image2.jpg
Cases are clearly distinguishable.
The remaining question is: what to do when there is no such file? Well, let's use the default list from the original dataset. Therefore, all the ImageNet-like dataset will have to include an extra file, when labels are not standard. And such list is included in the original format.
@IRDonch, @yasakova-anastasia, what's about AC?
I don't think AC supports this format at all.