zero123
zero123 copied to clipboard
RTMV test set scene and view idxs
Hi,
I am trying to replicate the RTMV table of the main paper but I can only get
RTMV | PSNR | SSIM | LPIPS |
---|---|---|---|
Zero123 (paper) | 10.41 | .606 | .323 |
Zero123 (rerun provided 105000.ckpt) | 10.09 | .540 | .406 |
Since I don't know the exact subset of Google RTMV scenes you used, I don't know whether the discrepancy is due to my incorrect implementation of the eval pipeline or a difference in the difficulty of the subset I sampled. Could you please provide the test scene idxs and view idxs? Thank you!
Hi @kylesargent , I can double-check the code later today but I remember we used the first 20 scenes in the Google Scanned Objects split in RTMV for evaluation. We used the first frame as input, and the subsequent 16 frames for evaluation. Could you try these settings and see if you can roughly reproduce the numbers? If not, I will double-check our evaluation code.
Hi, did you manage to replicate the results? @kylesargent
After using the specified view idxs the performance actually seems to drop slightly.
RTMV | PSNR | SSIM | LPIPS |
---|---|---|---|
Zero123 (paper) | 10.41 | .606 | .323 |
Zero123 (rerun provided 105000.ckpt) | 10.09 | .540 | .406 |
Zero123 (rerun provided 105000.ckpt) (using first 20 scenes, first 1+16 views per scene) | 10.07 | .536 | .422 |
I have swept over various hyperparameters such as # of DDIM steps and guidance scale to obtain the best performance.
Are there any more details for the eval that I may have missed? For instance, what are your own settings for DDIM steps and guidance scale? Is eval run on the ema model or the regular model?
Additionally, I was hoping you could comment on the time it takes for reasonable PSNR/SSIM/LPIPS metrics to be reached. I am retraining the model myself but so far at 20K steps the PSNR and visual quality of the novel views are somewhat poor (~8.2). I am continuing to train the model as we speak but I was curious about the time it takes for reasonable performance.
Thank you!
Hi @kylesargent , could you please share the evaluation code with me? I can take a look and see what might be the problem that's causing the discrepancy.
Assuming you are using the exact same configuration as stated in the paper (same batch size, learning rate, dataset etc.), our results should be reproducible at around 100K iterations. At around 50K, the model should produce reasonable novel view image. Could you share some examples of inference with your current checkpoint? Note that RTMV is an extremely challenging dataset typically used under multiview settings (50-150 views).
Thanks very much for the response.
Regarding the eval code, my implementation is at https://github.com/kylesargent/zero123/blob/e72542d07c13e0aca2809cb26c01c63889e920ea/zero123/ldm/models/diffusion/ddpm.py#L629. I would really appreciate it if you could take a look. The default eval parameters which I used are here: https://github.com/kylesargent/zero123/blob/e72542d07c13e0aca2809cb26c01c63889e920ea/zero123/configs/sd-objaverse-finetune-c_concat-256.yaml#L45
Just let me know if there are any discrepancies between this and your scripts and I can check the results with any necessary modifications.
Regarding the retrain: actually, now that my model has reached 42.5K iterations, the PSNR, SSIM and LPIPS are quite close to rows 2 and 3 of the table above. So I believe it's reasonable that running for the full 105K steps will get those numbers. So I guess now my main question is how to fully replicate the performance given in the main tables with the existing 105000.ckpt.
Here are some visual results from inference using the pretrained checkpoint. They look pretty reasonable to me:
Input:
GT:
Pred: