fairseq-image-captioning
fairseq-image-captioning copied to clipboard
About tokenizing test-split captions on own dataset
In preprocess/tokenize_captions.py,
def load_annotations(coco_dir):
with open(os.path.join(coco_dir, 'annotations', f'captions_train2014.json')) as f:
annotations = json.load(f)['annotations']
with open(os.path.join(coco_dir, 'annotations', f'captions_val2014.json')) as f:
annotations.extend(json.load(f)['annotations'])
return annotations
It seems that this code was not loading test2014. To avoid this problem, I would like you to modify this code into
def load_annotations(coco_dir):
with open(os.path.join(coco_dir, 'annotations', f'captions_train2014.json')) as f:
annotations = json.load(f)['annotations']
with open(os.path.join(coco_dir, 'annotations', f'captions_val2014.json')) as f:
annotations.extend(json.load(f)['annotations'])
with open(os.path.join(coco_dir, 'annotations', f'captions_test2014.json')) as f:
annotations.extend(json.load(f)['annotations'])
return annotations
which took me 2 days to figure out this problem, before which hasn't I walked out of suspecting my dataset problem yet:C
This project uses Karpathy splits for train, validation and test splitting. Karpathy splits are defined over MS-COCO train and validation sets only, hence test2014 is not loaded. Why do you need MS-COCO test data?