dmvr icon indicating copy to clipboard operation
dmvr copied to clipboard

Inconsistent label mapping among train, test, and validation sets

Open JTorres258 opened this issue 2 years ago • 0 comments

Hi,

First of all, thanks for sharing this code.

I am using it for implementing some Video Vision Transformers with a custom dataset. I wrote my own code for generating the CSV files, following the example you provided for the HMDB dataset. However, I have noticed an inconsistent behavior in generate_from_file.py when generating the TFRecord shards:

https://github.com/deepmind/dmvr/blob/77ccedaa084d29239eaeafddb0b2e83843b613a1/examples/generate_from_file.py#L179-L183

If I understand well, set removes duplicate values but, at the same, it is unordered, so you cannot be sure in which order the items will appear. This means that, even having the same labels for training, validating, and testing in my CSV files, there is a chance that the label mapping is different when generating the TFRecord shards of each subset. I have done a small check of this issue by printing l_map, and these are the mappings I got for each subset (using the same 4 labels in all the subsets and with no modifications in generate_from_file.py):

.../dataset/DMVR_train.csv
{'Label_C': 0, 'Label_A': 1, 'Label_D': 2, 'Label_B': 3}
.../dataset/DMVR_test.csv
{'Label_A': 0, 'Label_C': 1, 'Label_D': 2, 'Label_B': 3}
.../dataset/DMVR_validation.csv
{'Label_B': 0, 'Label_D': 1, 'Label_C': 2, 'Label_A': 3}

In fact, the mapping is different for each subset whenever I rerun this small test. I am not sure if there might be an error in my CSV files (I do not really think so, but maybe) or just a particular issue with my custom dataset.

JTorres258 avatar Nov 29 '22 09:11 JTorres258