RxR
RxR copied to clipboard
The number of instructions seem not consistent with the paper
Hi, I am trying to train my own model on RxR, but after I downloaded the data (guide data only), I found that the number of instructions in the provided file, i.e., rxr_train_guide.jsonl, seem not the same as your paper said. Specifically, the paper said there was 1,1089 paths in the training set, but it seems the unique path_ids in rxr_train_guide.jsonl (only English data I considered) are only 8000+. Besides, in Table 5 of the paper, it said there were totally 42K training pairs in Guide data for each language, but 26k was what I got from the file. I am confused about which numbers are right now. Please help. Thanks in advance.
Hey,
There should be ~11089 paths in the training set including all three languages. But, if you are only looking at English, the number of paths will be only 8824. While most paths are annotated in all three languages, there is a subset of paths that are only annotated in one language to give more variation in paths. See the last paragraph of Section 4, subsection 'Guide Task' in the paper.
The total number of training pairs in the Guide data is ~79467, which is the 'train' split of the 126k instructions detailed in Table 2 (the rest are in the val-seen, val-unseen, test-standard and test-challenge splits). This is about 26k per language, so your numbers sound right.
What about the 42k training pairs in Table 5? Should it be 26k actually?