PureT icon indicating copy to clipboard operation
PureT copied to clipboard

custom datasets

Open zml110120 opened this issue 2 years ago • 8 comments

how can i custom my datasets to train PureT model?

zml110120 avatar May 23 '22 16:05 zml110120

You need to construct JSON files (for your own datasets) referring to MSCOCO datasets and generate the necessary files for training. I upload a new notebook file "ICC分词预处理.ipynb" for reference, which is used for the Pre-Processing (The processing of generating necessary files) for Image Chinese Captioning datasets.

232525 avatar May 24 '22 05:05 232525

thank you! I will try soon.

zml110120 avatar May 24 '22 06:05 zml110120

Im sorry i still have some questions. In your "ICC分词输出.ipynb" , i cant find any about "coco_train_input.pkl". Do you have any tools to transform COCO Caption(for English,not chinese).I mean how do I get all the files under the “mscoco” folder. such as "txt","misc","sent"

zml110120 avatar May 24 '22 10:05 zml110120

image The core generation logic (how to generate all necessary files under mscoco folder) is located below these code cells of the snapshot image. I have not saved the pre-processing codes for COCO datasets. Actually, you only need to replace the prefix "ICC_" of all files with "coco_" (such as replace `sent_input_file = './ICC_train_input.pkl'` with `sent_input_file = './coco_train_input.pkl'`) and replace `raw_train_annotation_file` and `raw_val_annotation_file` with MSCOCO annotation JSON file. Their generation logic is consistent on the whole. Or you can also refer to the reference Github projects listed in the README to find more info.

232525 avatar May 24 '22 10:05 232525

I'm sorry for my oversight. There is a "dataset_coco. json" file in the "mscoco" directory and I would like to know how this file is generated. I haven't started running the following code yet. image

zml110120 avatar May 25 '22 04:05 zml110120

The "dataset_coco. json" file is the Karpathy split annotation file of MSCOCO Captioning, it is just the re-organization of MSCOCO raw JSON annotation. Maybe you need to refer to https://github.com/karpathy/neuraltalk for more details.

232525 avatar May 25 '22 05:05 232525

Could you please upload the English version of this file "ICC分词预处理.ipynb"?

Debolena7 avatar Aug 21 '22 21:08 Debolena7

How to preprocess for image english captioning datasets?

Sparkle-Q avatar Apr 03 '24 03:04 Sparkle-Q