PureT
PureT copied to clipboard
custom datasets
how can i custom my datasets to train PureT model?
You need to construct JSON files (for your own datasets) referring to MSCOCO datasets and generate the necessary files for training. I upload a new notebook file "ICC分词预处理.ipynb" for reference, which is used for the Pre-Processing (The processing of generating necessary files) for Image Chinese Captioning datasets.
thank you! I will try soon.
Im sorry i still have some questions. In your "ICC分词输出.ipynb" , i can
t find any about "coco_train_input.pkl". Do you have any tools to transform COCO Caption(for English,not chinese).I mean how do I get all the files under the “mscoco” folder. such as "txt","misc","sent"
data:image/s3,"s3://crabby-images/03532/0353282249ed0a70832ee42efc5401ca5c7c8e9c" alt="image"
I'm sorry for my oversight. There is a "dataset_coco. json" file in the "mscoco" directory and I would like to know how this file is generated.
I haven't started running the following code yet.
The "dataset_coco. json" file is the Karpathy split annotation file of MSCOCO Captioning, it is just the re-organization of MSCOCO raw JSON annotation. Maybe you need to refer to https://github.com/karpathy/neuraltalk for more details.
Could you please upload the English version of this file "ICC分词预处理.ipynb"?
How to preprocess for image english captioning datasets?