URAE Evaluation script

Hi thanks for your amazing work. I really appreciate the availability of the code and of the checkpoints. In your paper in table 1 you have Quantitative results of the baseline methods and URAE. The prompts are from HPD and DPG datasets. I am trying to replicate your results but I do not understand, do you generate the images with the test prompts of HPDv2? And then you calculate the FID and LPIPS within the available test images of the same dataset? Or what you used as reference images? So in the end its possible for you to release the evaluation script? Thank you

May 05 '25 09:05 LuigiSigillo

Hi, @LuigiSigillo, thanks for your attention to our work! The prompts for HPD datasets are from https://huggingface.co/datasets/zhwang/HPDv2/tree/main/benchmark (exclude the drawbench) you can get the prompts through

import hpsv2
all_prompts = hpsv2.benchmark_prompts('all') 
for style, prompts in all_prompts.items():
    for idx, prompt in enumerate(prompts):
        image = TextToImageModel(prompt)

as the authors of HPS provided. We generate all the images following the instruction of HPS for fair comparison.

We calculate FID and LPIPS with the reference images generated by the FLUX1.1 [Pro] Ultra model with the same prompts. We will make these settings clearer. Thanks for pointing out.

Thanks for your question, we will release the evaluation script soon.

May 06 '25 09:05 Lexie-YU

Thank you for your fast answer! Now I understand, so you have FLUX1.1 [Pro] Ultra which serves to produce the reference images missing for the hpsv2 prompts! To reproduce your evaluation paper results, considering the price of 0.06$ per image generated by flux pro ultra, considering that there are 4 categories, each of which has 800 images... it's almost 200$ 🤑

May 06 '25 09:05 LuigiSigillo