multi_object_datasets copied to clipboard
Number of objects in CLEVRwithmasks scenes
All fields seem to be padded with zeros up to 11 objects. How to find out the true number of objects?
What are valid integer codes for color and materials? non-zero? Colors and materials are encoded as uint8 in CLEVRwithmasks and with strings in CLEVR.
An example of dump that I get:
'color': [0, 1, 2, 3, 1, 1, 4, 5, 0, 0, 0],
'material': [0, 1, 1, 2, 2, 1, 2, 2, 0, 0, 0],
'shape': [0, 1, 2, 1, 1, 3, 3, 3, 0, 0, 0],
'size': [0, 1, 1, 2, 2, 2, 2, 1, 0, 0, 0],
'visibility': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0],
'pixel_coords': [[0.0, 0.0, 0.0], [216.0, 92.0, 11.397212982177734], [184.0, 127.0, 9.41761589050293], [116.0, 81.0, 13.153035163879395], [51.0, 121.0, 10.44654655456543], [123.0, 129.0, 10.018261909484863], [36.0, 109.0, 11.129423141479492], [160.0, 176.0, 7.559253692626953], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]], 'rotation': [0.0, 206.87570190429688, 158.63943481445312, 330.7286071777344, 31.30453109741211, 198.59092712402344, 2.6792359352111816, 243.39840698242188, 0.0, 0.0, 0.0],
'x': [0.0, 0.7548010349273682, 1.6617735624313354, -2.911912679672241, -1.656102180480957, 0.1737128645181656, -2.6883020401000977, 2.8262786865234375, 0.0, 0.0, 0.0],
'y': [0.0, 1.7722225189208984, -0.6132868528366089, 0.35031434893608093, -2.893202543258667, -1.555863618850708, -2.886444568634033, -2.4974899291992188, 0.0, 0.0, 0.0],
'z': [0.0, 0.699999988079071, 0.699999988079071, 0.3499999940395355, 0.3499999940395355, 0.3499999940395355, 0.3499999940395355, 0.699999988079071, 0.0, 0.0, 0.0]
As you can see, the first object seems to have zeros everywhere, but visibility is 1.0 :/
Hi Vadim,
The first object has all-zero attributes (color, material, shape, and size) because it represents the background. As you may have observed, the first segmentation mask (for any scene) contains the background pixels.
The mapping from integers to words for CLEVR features is as follows:
"material": {"metal": 2, "rubber": 1},
"size": {"large": 1, "small": 2},
"color": {"cyan": 2, "red": 1, "brown": 5, "gray": 6, "purple": 7, "yellow": 8, "blue": 4, "green": 3},
"shape": {"cube": 3, "sphere": 1, "cylinder": 2}
And you can find the number of visible objects in any scene using the visibility
vector. Note that it codes both the background and foreground objects as 1.0.
Hope this helps, Rish
Thanks! It would help adding these to README!
I've got another question: how were train/test splits done? (for both CLEVR6 and CLEVR10) Could you provide the file lists?
In Multi-Object Representation Learning with Iterative Variational Inference, we used only images containing 3-6 visible foreground objects (inclusive range) to train our model i.e. CLEVR6. We then assessed the model's generalization to the full dataset (where scenes could contain up to 10 objects).
You can construct the train split from CLEVR (with masks) by writing a filtering function which returns True when sum(visibility) <= 7
. Sorry it won't be possible to provide an exact file list.
Do I understand correctly that sum(visibility) <= 7
was train and sum(visibility) > 7
- test? Or did test also contain some (or all?) of sum(visibility) <= 7
How many images were in train/test?
Basically, I'm trying to figure out the object discovery evaluation setup for Slot Attention which I think matched your setup (
Thank you!
CLEVR6 := sum(visibility) <= 7, whereas CLEVR10 was the whole dataset (any number of visible objects). That should reflect the terminology in the Slot Attention paper.
Does test split ensure it doesn't intersect too much with train? E.g. are train images excluded? Are there any filtering wrt object properties? Do you have somewhere still sizes of train/test? maybe in comments inside the arxiv submission? :)
Is it true that:
- first 70k examples are used for train (and further filtered to contain <=6 objects). all of them are used in training
- remaining 30k examples are used for test (and further filtered to contain <=6 objects or <= 10 objects). from the filtered 320 examples are sampled uniformly