dmvr
dmvr copied to clipboard
Inconsistent label mapping among train, test, and validation sets
Hi,
First of all, thanks for sharing this code.
I am using it for implementing some Video Vision Transformers with a custom dataset. I wrote my own code for generating the CSV files, following the example you provided for the HMDB dataset. However, I have noticed an inconsistent behavior in generate_from_file.py when generating the TFRecord shards:
https://github.com/deepmind/dmvr/blob/77ccedaa084d29239eaeafddb0b2e83843b613a1/examples/generate_from_file.py#L179-L183
If I understand well, set removes duplicate values but, at the same, it is unordered, so you cannot be sure in which order the items will appear. This means that, even having the same labels for training, validating, and testing in my CSV files, there is a chance that the label mapping is different when generating the TFRecord shards of each subset. I have done a small check of this issue by printing l_map
, and these are the mappings I got for each subset (using the same 4 labels in all the subsets and with no modifications in generate_from_file.py
):
.../dataset/DMVR_train.csv
{'Label_C': 0, 'Label_A': 1, 'Label_D': 2, 'Label_B': 3}
.../dataset/DMVR_test.csv
{'Label_A': 0, 'Label_C': 1, 'Label_D': 2, 'Label_B': 3}
.../dataset/DMVR_validation.csv
{'Label_B': 0, 'Label_D': 1, 'Label_C': 2, 'Label_A': 3}
In fact, the mapping is different for each subset whenever I rerun this small test. I am not sure if there might be an error in my CSV files (I do not really think so, but maybe) or just a particular issue with my custom dataset.