pytorch_RVAE icon indicating copy to clipboard operation
pytorch_RVAE copied to clipboard

How to create MSCOCO dataset

Open Frankey419 opened this issue 5 years ago • 1 comments

Hi, I got the MSCOCO captions_train2014.json and captions_val2014.json, as described in the paper, there are 82,783 train samples and 40,504 val samples, every sample contains 5 captions. If I omit one caption and combine the other four into two paraphrase pairs, there will be about 2*(82,783 + 40,504)=246,574 pairs. How can i get the 320k paraphrase pairs?

Frankey419 avatar Sep 08 '19 02:09 Frankey419

The author replies me how to create the dataset as follows: Each data has multiple captions. Say a,b and c are paraphrases of each other then to make it into a pair you can do the following pairing: a -> b b -> a a -> c c -> a b -> c c -> b.

This will mean a lot more data-points than the total number of image-caption pair. However, make sure that all the phrases that are part of a single image remain either in train or in val.

jackyuanjie1990 avatar Jan 30 '21 22:01 jackyuanjie1990