ptp
ptp copied to clipboard
Questions about ptp
Hi, Congratulations on the great success of your wonderful work! I have several questions about ptp in terms of the pretraining/fintuning settings described in the paper. The questions are as follows:
- I noticed that you perform zero-shot retrieval experiments on MS COCO, but in Chap 4.1 of the paper I find COCO is also used in pre-training. Did you exclude COCO from the pre-training dataset before zero-shot testing on COCO?
- You mentioned in the paper that text prompt is used only in the pretraining stage. That sounds quite fair because it doesn't change the inference setting. As far as I'm concerned, using ptp will change the distribution of image captions and make a distribution gap between training corpus and testing corpus, which might harm the retrieval results. But it seems to be the opposite, it helps the downstream retrieval rather than harming it. Why?
For example, in the zero-shot retrieval setting, captions in the training stage are like "...The block x has a x" but the prompts are not used anymore during inference, why doesn't this harm the performance?
Does the scale of training dataset matters here? I'm curious if it helps if text prompts of ptp is used in the finetuning stage (instead of pre-training)? I try to extend ptp to video retreival, and did some experiments on video datasets, trying to add ptp in the finetuning stage when fintuning on MSRVTT, but the performance drops a little bit.
Looking forward to your reply!