T2I-Adapter Reproduce Issue on COCO 2017 Validation Set

Hi,

Thanks for the nice work! But during the evaluation of your provided model on hugging face, I found that I could not reproduce your provided FID and CLIP score. Also, the quantitative results of PITI, Stable Diffusion are also worse than the reported ones. Currently, I am using the hyperparameters setting that you provide in the tutorial of your GitHub repo. Could you please provide your hyperparameters setting for evaluation on COCO 2017 validation set? Thanks in advance!

Best regards, Chang

Apr 30 '23 12:04 AlonzoLeeeooo

Hi,

Thanks for the nice work! But during the evaluation of your provided model on hugging face, I found that I could not reproduce your provided FID and CLIP score. Also, the quantitative results of PITI, Stable Diffusion are also worse than the reported ones. Currently, I am using the hyperparameters setting that you provide in the tutorial of your GitHub repo. Could you please provide your hyperparameters setting for evaluation on COCO 2017 validation set? Thanks in advance!

Best regards, Chang

P.S. I am using the implementation of https://github.com/mseitzer/pytorch-fid to calculate FID score, and using the official implementation of torchmetrics for CLIP score.

Apr 30 '23 12:04 AlonzoLeeeooo

Hi, we use the open CLIP to calculate the CLIP score. You can refer to https://github.com/TencentARC/T2I-Adapter/issues/57

Apr 30 '23 15:04 MC-E

Hi, we use the open CLIP to calculate the CLIP score. You can refer to #57

Hi @MC-E ,

Thanks for replying. I would try reproducing those metrics again.

Best,

May 01 '23 03:05 AlonzoLeeeooo

Hi, we use the open CLIP to calculate the CLIP score. You can refer to #57

Hi @MC-E ,

By the way, could you please tell me which version of Open CLIP model are you using? Is it ViT-H-14 or ViT-B-32, or others? Thanks so much!

Best,

May 01 '23 03:05 AlonzoLeeeooo

Hi @MC-E ,

Thank you for providing your evaluation code. I have re-implemented the evaluation on COCO 2017 validation set, and the tested CLIP score is slightly better than before. But the evaluated result is still worse than your reported ones, where the FID is 21.72 and the CLIP score is 0.2597. Besides, I also find that the CLIP score of SD (version v1.4) is better than your reported one, which is 0.2673 compared to 0.2648. For the hyperparameters setting, I follow the recommended setup in your GitHub repo instructions. Could you please provide your parameters setting during the evaluation of COCO 2017 validation set? Or is there any parameter that I am not configuring right?

Thank you in advance for replying in your busy schedule. Hope everything goes well with you!

Best,

May 02 '23 04:05 AlonzoLeeeooo

@AlonzoLeeeooo There are 5k images and 30k (text, image) pairs in COCO val2017. How many (image, text) pairs do you test on? 5k or 30k?

May 09 '23 04:05 ShihaoZhaoZSH

@AlonzoLeeeooo There are 5k images and 30k (text, image) pairs in COCO val2017. How many (image, text) pairs do you test on? 5k or 30k?

Hi @ShihaoZhaoZSH ,

As is reported in the paper, the evaluation is implemented on the validation set with 5k images. Note that the validation set with 5k images also has official caption annotations.

Best,

May 09 '23 09:05 AlonzoLeeeooo

@AlonzoLeeeooo There are 5k images and 30k (text, image) pairs in COCO val2017. How many (image, text) pairs do you test on? 5k or 30k?

Hi @ShihaoZhaoZSH ,

As is reported in the paper, the evaluation is implemented on the validation set with 5k images. Note that the validation set with 5k images also has official caption annotations.

Best,

Thanks for your reply. There are 6 captions for each image in the validation set. So do you mean you run all the 6*5=30k text-image pairs? or just randomly pick up one caption for each image and test on 5k text-image pairs.

May 09 '23 12:05 ShihaoZhaoZSH

@AlonzoLeeeooo There are 5k images and 30k (text, image) pairs in COCO val2017. How many (image, text) pairs do you test on? 5k or 30k?

Hi @ShihaoZhaoZSH , As is reported in the paper, the evaluation is implemented on the validation set with 5k images. Note that the validation set with 5k images also has official caption annotations. Best,

Thanks for your reply. There are 6 captions for each image in the validation set. So do you mean you run all the 6*5=30k text-image pairs? or just randomly pick up one caption for each image and test on 5k text-image pairs.

For each image sample, I randomly pick one as the corresponding text prompt. For a public dataset on such a scale, it should be able to measure the generation ability of the evaluated model properly.

May 09 '23 12:05 AlonzoLeeeooo

@AlonzoLeeeooo There are 5k images and 30k (text, image) pairs in COCO val2017. How many (image, text) pairs do you test on? 5k or 30k?

Hi @ShihaoZhaoZSH , As is reported in the paper, the evaluation is implemented on the validation set with 5k images. Note that the validation set with 5k images also has official caption annotations. Best,

Thanks for your reply. There are 6 captions for each image in the validation set. So do you mean you run all the 6*5=30k text-image pairs? or just randomly pick up one caption for each image and test on 5k text-image pairs.

For each image sample, I randomly pick one as the corresponding text prompt. For a public dataset on such a scale, it should be able to measure the generation ability of the evaluated model properly.

got it. I met the same problem with you. my reproduced FID results are also greater than 20 on both seg+text and sketch+text settings.

May 09 '23 12:05 ShihaoZhaoZSH

@AlonzoLeeeooo There are 5k images and 30k (text, image) pairs in COCO val2017. How many (image, text) pairs do you test on? 5k or 30k?

Hi @ShihaoZhaoZSH , As is reported in the paper, the evaluation is implemented on the validation set with 5k images. Note that the validation set with 5k images also has official caption annotations. Best,

Thanks for your reply. There are 6 captions for each image in the validation set. So do you mean you run all the 6*5=30k text-image pairs? or just randomly pick up one caption for each image and test on 5k text-image pairs.

For each image sample, I randomly pick one as the corresponding text prompt. For a public dataset on such a scale, it should be able to measure the generation ability of the evaluated model properly.

got it. I met the same problem with you. my reproduced FID results are also greater than 20 on both seg+text and sketch+text settings.

How is your reproduced CLIP score? Actually, the reproduced FID score could be reasonable due to some possible fluctuations on different devices. But my reproduced CLIP score is obviously worse than the reported one, which is 0.2597 (reproduced) compared to 0.2673 (reported).

May 09 '23 13:05 AlonzoLeeeooo

@AlonzoLeeeooo There are 5k images and 30k (text, image) pairs in COCO val2017. How many (image, text) pairs do you test on? 5k or 30k?

Hi @ShihaoZhaoZSH , As is reported in the paper, the evaluation is implemented on the validation set with 5k images. Note that the validation set with 5k images also has official caption annotations. Best,

Thanks for your reply. There are 6 captions for each image in the validation set. So do you mean you run all the 6*5=30k text-image pairs? or just randomly pick up one caption for each image and test on 5k text-image pairs.

For each image sample, I randomly pick one as the corresponding text prompt. For a public dataset on such a scale, it should be able to measure the generation ability of the evaluated model properly.

got it. I met the same problem with you. my reproduced FID results are also greater than 20 on both seg+text and sketch+text settings.

How is your reproduced CLIP score? Actually, the reproduced FID score could be reasonable due to some possible fluctuations on different devices. But my reproduced CLIP score is obviously worse than the reported one, which is 0.2597 (reproduced) compared to 0.2673 (reported).

Sorry that for the main line, I used the anything pretrained weights. Now I turn to the stable diffusion weights and the FID is lower than 20. But the clip score is still lower than 0.26.

May 11 '23 14:05 ShihaoZhaoZSH

I have a stupid question. When calculating clip score, is it right to calculate the clip scores of all coco2017val image text pairs and then average them,？and what are the negative_prompt required for generation？

Oct 25 '23 08:10 YibooZhao

@AlonzoLeeeooo Hi, Could you please add your WeChat and ask some questions about training? my email is [email protected].

Dec 26 '23 15:12 dmmSJTU

@AlonzoLeeeooo Hi, Could you please add your WeChat and ask some questions about training? my email is [email protected].

I have sent my WeChat number to you through e-mail.

Dec 27 '23 13:12 AlonzoLeeeooo

I have a stupid question. When calculating clip score, is it right to calculate the clip scores of all coco2017val image text pairs and then average them,？and what are the negative_prompt required for generation？

I remembered that there is an issue that illustrates this setting (not sure if it is using the first caption of each image). You may have a check.

Dec 27 '23 13:12 AlonzoLeeeooo