MM-Diffusion
MM-Diffusion copied to clipboard
Reproducing results in the paper
Thanks for the great work! However, directly run the following commands do not produce the same FVD/KVD/FAD scores as in the paper. May I ask the data and configurations for reproducing results in paper? i.e., FVD=117.20, KVD=5.78, FAD=10.72
bash ssh_scripts/multimodal_sample_sr.sh
bash ssh_scripts/multimodal_eval.sh
When running the above commands, the output I got is:
evaluate for 2048 samples
metric:{'fvd': 338.2535400390625, 'kvd': 10.005603799600976, 'fad': 1.3610674068331718}
I am also facing similar issues for both SR (256 x 256) as well as 64 x 64 resolution videos on both AIST++ and landscape datasets.
@ltzheng In the evaluation code, the FAD is being multiplied by 1e3 instead of 1e4 which is mentioned in the paper. Correcting the scaling factor brings your FAD to 13.61 which is closer to the one mentioned in the paper 10.72 but still higher.
Waiting for authors' response to this thread.
@ltzheng In the evaluation code, the FAD is being multiplied by 1e3 instead of 1e4 which is mentioned in the paper. Correcting the scaling factor brings your FAD to 13.61 which is closer to the one mentioned in the paper 10.72 but still higher.
Waiting for authors' response to this thread.
Hi, I have quick question about FAD metric. In the line 17 of https://github.com/researchmm/MM-Diffusion/blame/1d2d5ad9b47f57e7d300e087af8eb93181da094d/mm_diffusion/evaluator.py#L17, the Audio Rate = 44100, but the original sr is 16k. Do you know why the script uses 44100 rather than 16k? Many thanks.
Hey @kaiw7 , I think the audio rate during evaluation is set to 44.1kHz because the audioclip model is trained on 44.1kHz data.
In my opinion a fairer method to handle the evaluations would have been to remove the frequencies which are below 8kHz for the real data, but I guess this is also fine as the metrics even take into account the inability of the model to generate high frequency components of the audio.
Hey @kaiw7 , I think the audio rate during evaluation is set to 44.1kHz because the audioclip model is trained on 44.1kHz data.
In my opinion a fairer method to handle the evaluations would have been to remove the frequencies which are below 8kHz for the real data, but I guess this is also fine as the metrics even take into account the inability of the model to generate high frequency components of the audio.
@mayank-git-hub Thank you very much for your kind reply. Can you reproduce the results of MM-Diffusion about either objective metric or subjective quality?
The results in the paper FVD=117.20, KVD=5.78, FAD=10.72 are produced by DDPM sampling (1000 steps) with a video size of 64x64. The default sampling method in multimodal_sample_sr.sh is DPM solver, which gives FVD=229.08, KVD=3.26, FAD=9.39.
I ran the code on the AIST++ dataset using the DPM solver with 2048 samples. The results I obtained are: {‘fvd’: 150.32235717773438, ‘kvd’: 26.141661956593282, ‘fad’: 12.613938888534904}. These results are comparable to those reported in the paper: FVD=176.55, KVD=31.92, FAD=12.90.
@mayank-git-hub “The FAD was being multiplied by 1e3 instead of 1e4 as mentioned in the paper.” This bug has been fixed. Thanks for the reminder.