ScanRefer
ScanRefer copied to clipboard
A utterance refer to a more than one object
As can be seen below, in the scene scene0011_00 which is in the val split, the utterance for one chair is This is a brown chair. There are many identical chairs setting around the table it sets at.
Obviously, there are at least 4 chairs that match this utterance. Such ambiguous descriptions in the training set may provide some supervision signals to facilitate the model's learning of vision-language alignment, but encountering such ambiguous descriptions in the validation set does not help us evaluate the model's performance.