ITI-GEN icon indicating copy to clipboard operation
ITI-GEN copied to clipboard

Missing Performance Metric

Open Bearwithchris opened this issue 2 years ago • 14 comments

Hi,

Would it be possible to include the performance metrics code used in the original paper?

Bearwithchris avatar Oct 17 '23 05:10 Bearwithchris

Same here, it would be best if you could provide the code for computing the metrics.

ZichenMiao avatar Oct 17 '23 19:10 ZichenMiao

Also, could you provide an equation or detailed description on how $D_{KL}$ is actually computed?

ZichenMiao avatar Oct 18 '23 18:10 ZichenMiao

Thanks for your interest in our work!

We compute KL divergence by comparing predictions with a uniform distribution. We have released the script, including both the use of CLIP for attribute classification and the procedures for computing the KL divergence.

czhang0528 avatar Oct 29 '23 06:10 czhang0528

Thanks!

ZichenMiao avatar Nov 02 '23 23:11 ZichenMiao

Hi, sorry for the late follow-up, and thank you for the updates.

Just for clarification, the performance code I was referring to was the "pre-trained classifiers" as mentioned in Appendix D. Could this be provided for the results presented in the main manuscript? It is hard to understand Iti-gen's performance without these scripts/saved-models.

Additionally, could the same be done for the FID metrics (both the measurements script and the preprocessing data script)? I have tried to replicate the FID measurements but could not get similar results, specifically the FID score seems to be worse e.g., Table 2 reports FID 60.38 with CelebA, whereas I measure 84.28. I believe this is probably something to do with the setups being different.

edit: Could you also provide the FID parameters e.g., reference dataset size and how this dataset is attained?

Thank you for the clarification in advance.

Bearwithchris avatar Nov 16 '23 04:11 Bearwithchris

Thank you for the questions!

  1. Could the classifier weights be provided?

As mentioned in Appendix D, we use the CLIP, pre-trained classifiers, and human evaluation together to provide the quantitative results. However, as mentioned in README, we find the CLIP is superior to pre-trained classifiers (which might be more biased) in making accurate predictions. We thus wish the academic community could reproduce the results in a more standard and unified way. Therefore, we decided to only provide the evaluation script using CLIP.

Here, we can still provide more details to assist you with the evaluation.

For CelebA attributes, in terms of writing evaluation prompts, please refer (but not limit) to Table A3 in the supplementary materials. CLIP can provide accurate predictions for most of the attributes. For a few attributes like ‘Mustache’, which CLIP might be erroneous as mentioned in our paper, combine the results with human evaluation.

For the FairFace age attribute, we can not provide the evaluation prompts because of the sensitivity of evaluating social attributes.

For the FAIR skin tone attribute, we first leverage a face parsing method to extract the face region for generated images and then follow FAIR paper’s method to obtain the skin tone type (compute ITA and then obtain Fitzpatrick skin type based on threshold). We can not provide the code because of the sensitivity of evaluating social attributes.

For LHQ scene attributes, we only show qualitative results.

  1. Issues related to FID.

We have updated the evaluation script, including codes for computing the FID score.

Specifically, each score in our paper is computed using images over 5K. Since FID scores can be affected by various factors such as the image number, it might be possible that you got a higher score if you use fewer images. For sanity check, we suggest directly comparing with the FID score of the images from baseline Stable Diffusion in the same setup.

Xuanbai-Chen avatar Nov 19 '23 00:11 Xuanbai-Chen

Thank you for the questions!

  1. Could the classifier weights be provided?

As mentioned in Appendix D, we use the CLIP, pre-trained classifiers, and human evaluation together to provide the quantitative results. However, as mentioned in README, we find the CLIP is superior to pre-trained classifiers (which might be more biased) in making accurate predictions. We thus wish the academic community could reproduce the results in a more standard and unified way. Therefore, we decided to only provide the evaluation script using CLIP.

Here, we can still provide more details to assist you with the evaluation.

For CelebA attributes, in terms of writing evaluation prompts, please refer (but not limit) to Table A3 in the supplementary materials. CLIP can provide accurate predictions for most of the attributes. For a few attributes like ‘Mustache’, which CLIP might be erroneous as mentioned in our paper, combine the results with human evaluation.

For the FairFace age attribute, we can not provide the evaluation prompts because of the sensitivity of evaluating social attributes.

For the FAIR skin tone attribute, we first leverage a face parsing method to extract the face region for generated images and then follow FAIR paper’s method to obtain the skin tone type (compute ITA and then obtain Fitzpatrick skin type based on threshold). We can not provide the code because of the sensitivity of evaluating social attributes.

For LHQ scene attributes, we only show qualitative results.

  1. Issues related to FID.

We have updated the evaluation script, including codes for computing the FID score.

Specifically, each score in our paper is computed using images over 5K. Since FID scores can be affected by various factors such as the image number, it might be possible that you got a higher score if you use fewer images. For sanity check, we suggest directly comparing with the FID score of the images from baseline Stable Diffusion in the same setup.

It is also a FID problem. Below is my generation and verification. The seed is 42, 5000 pictures are generated, and the FID is 143.66. Can you provide verification steps?

export CUDA_VISIBLE_DEVICES=5
python generation.py \
    --config='models/sd/configs/stable-diffusion/v1-inference.yaml' \
    --ckpt='sd_realistic/Realistic_Vision_V4.0.ckpt' \
    --plms \
    --skip_grid \
    --attr-list='Eyeglasses' \
    --outdir='2024_3_21_original_iti-gen_outputs/a_headshot_of_a_person_Eyeglasses/sample_results_0' \
    --prompt-path='ckpts/a_headshot_of_a_person_Eyeglasses/original_prompt_embedding/basis_final_embed_19.pt' \
    --n_iter=1250 \
    --n_samples=4
python evaluation.py \
    --img-folder 2024_3_21_original_iti-gen_outputs/a_headshot_of_a_person_Eyeglasses/sample_results_0/Eyeglasses_positive \
    --device 7 \
    --class-list 'a headshot of a person , eyeglasses' 'a headshot of a person' 

sd base model is realistic 4.0 results image

image

euminds avatar Mar 21 '24 05:03 euminds

Thanks for your interest in our work!

  • the FID is 143.66.

This is possible because you use the images, which all have eyeglasses, to evaluate (based on the command you provide, "Eyeglasses_positive"). Since we compute FID using the FFHQ dataset, in order to obtain a relatively low score, we require the distribution between generated images and FFHQ to be as similar as possible. However, in the real domain (FFHQ), there does not exist many face images wearing eyeglasses. Therefore, only leveraging images with eyeglasses will not only increase the domain gap but also decrease the diversity of the generated images.

  • Can you provide verification steps?

On page 7 of our main text, you can find a sentence in paragraph Single Binary Attribute. We evaluate 5 text prompts -- “a headshot of a {person, professor, doctor, worker, firefighter}” -- and sample 200 images per prompt for each attribute, resulting in 40K generated images. To compute the FID score (CelebA one in Table 2, 60.38), we use a subset of 40K images generated by "a headshot of a person", resulting in 8K images in total. Containing different attributes helps in improving the diversity of generated images as well as reducing the gap with FFHQ.

  • the provided results.

They are from the original SD or ITI-GEN? They look good to me. Could you attach some ITI-GEN results if they are not?

Xuanbai-Chen avatar Mar 21 '24 07:03 Xuanbai-Chen

Thank you for your response. My results are from a fine-tuned version of the sd1.5 model, named realistic v4.0. ITI-Gen exhibits a certain degree of generalization for these models.

euminds avatar Mar 22 '24 00:03 euminds

compute the FID score (CelebA one in Table 2, 60.38), we use a subset of 40K images generated by "a headshot of a person", resulting in 8K images in total. Containing different attributes helps in improving the diversity of generated images as well as reducing the gap with FFHQ

8k images measn 20 images for each attributes??

euminds avatar Mar 22 '24 00:03 euminds

200 images for each attribute (eg, Eyeglasses), with 100 images containing such attribute (wearing eyeglasses) and the other 100 don't (without eyeglasses).

Xuanbai-Chen avatar Mar 22 '24 00:03 Xuanbai-Chen

On page 7 of our main text, you can find a sentence in paragraph Single Binary Attribute. We evaluate 5 text prompts -- “a headshot of a {person, professor, doctor, worker, firefighter}” -- and sample 200 images per prompt for each attribute, resulting in 40K generated images. To compute the FID score (CelebA one in Table 2, 60.38), we use a subset of 40K images generated by "a headshot of a person", resulting in 8K images in total. Containing different attributes helps in improving the diversity of generated images as well as reducing the gap with FFHQ.

""" On page 7 of our main text, you can find a sentence in paragraph Single Binary Attribute. We evaluate 5 text prompts -- “a headshot of a {person, professor, doctor, worker, firefighter}” -- and sample 200 images per prompt for each attribute, resulting in 40K generated images. To compute the FID score (CelebA one in Table 2, 60.38), we use a subset of 40K images generated by "a headshot of a person", resulting in 8K images in total. Containing different attributes helps in improving the diversity of generated images as well as reducing the gap with FFHQ. """ What is the distribution of the 8k images mentioned in the text?

euminds avatar Mar 22 '24 00:03 euminds

The detailed steps are as follows:

  1. train ITI-GEN using "a headshot of a person" on every single binary attribute of CelebA and obtain 40 learnt soft tokens.
  2. leverage every soft token to sample 200 images (100 images each category).
  3. use 200*40 images to compute FID with FFHQ.

Xuanbai-Chen avatar Mar 22 '24 05:03 Xuanbai-Chen

o obtain a relatively low score, we require the distribution between generated images and FFHQ to be as similar as possible. However, in the real domain (FFHQ), there does not exist many face images wearing eyeglasses. Therefore, only leveraging images with e

Hi Author,

thank you for all the help. I've noticed the latest discussion which made me a little confused. As per Fig. 14 of your appendix, FID if [55.6,67.4] should in fact be achievable with just Male x Eyeglasses. Correct me if I've misunderstood this?

I appreciate the discussion.

Bearwithchris avatar Apr 05 '24 09:04 Bearwithchris

o obtain a relatively low score, we require the distribution between generated images and FFHQ to be as similar as possible. However, in the real domain (FFHQ), there does not exist many face images wearing eyeglasses. Therefore, only leveraging images with e

Hi Author,

thank you for all the help. I've noticed the latest discussion which made me a little confused. As per Fig. 14 of your appendix, FID if [55.6,67.4] should in fact be achievable with just Male x Eyeglasses. Correct me if I've misunderstood this?

I appreciate the discussion.

Different FIDs in Fig. 14 are achievable by using different token lengths. However, they are not achieved using Male x Eyeglasses. Actually, they are achieved using the same process as the comments mentioned before.

Xuanbai-Chen avatar Jun 25 '24 13:06 Xuanbai-Chen