zero123 icon indicating copy to clipboard operation
zero123 copied to clipboard

RTMV test set scene and view idxs

Open kylesargent opened this issue 1 year ago • 5 comments

Hi,

I am trying to replicate the RTMV table of the main paper but I can only get

RTMV PSNR SSIM LPIPS
Zero123 (paper) 10.41 .606 .323
Zero123 (rerun provided 105000.ckpt) 10.09 .540 .406

Since I don't know the exact subset of Google RTMV scenes you used, I don't know whether the discrepancy is due to my incorrect implementation of the eval pipeline or a difference in the difficulty of the subset I sampled. Could you please provide the test scene idxs and view idxs? Thank you!

kylesargent avatar Jul 05 '23 21:07 kylesargent

Hi @kylesargent , I can double-check the code later today but I remember we used the first 20 scenes in the Google Scanned Objects split in RTMV for evaluation. We used the first frame as input, and the subsequent 16 frames for evaluation. Could you try these settings and see if you can roughly reproduce the numbers? If not, I will double-check our evaluation code.

ruoshiliu avatar Jul 06 '23 15:07 ruoshiliu

Hi, did you manage to replicate the results? @kylesargent

VitorGuizilini-TRI avatar Jul 11 '23 23:07 VitorGuizilini-TRI

After using the specified view idxs the performance actually seems to drop slightly.

RTMV PSNR SSIM LPIPS
Zero123 (paper) 10.41 .606 .323
Zero123 (rerun provided 105000.ckpt) 10.09 .540 .406
Zero123 (rerun provided 105000.ckpt) (using first 20 scenes, first 1+16 views per scene) 10.07 .536 .422

I have swept over various hyperparameters such as # of DDIM steps and guidance scale to obtain the best performance.

Are there any more details for the eval that I may have missed? For instance, what are your own settings for DDIM steps and guidance scale? Is eval run on the ema model or the regular model?

Additionally, I was hoping you could comment on the time it takes for reasonable PSNR/SSIM/LPIPS metrics to be reached. I am retraining the model myself but so far at 20K steps the PSNR and visual quality of the novel views are somewhat poor (~8.2). I am continuing to train the model as we speak but I was curious about the time it takes for reasonable performance.

Thank you!

kylesargent avatar Jul 12 '23 02:07 kylesargent

Hi @kylesargent , could you please share the evaluation code with me? I can take a look and see what might be the problem that's causing the discrepancy.

Assuming you are using the exact same configuration as stated in the paper (same batch size, learning rate, dataset etc.), our results should be reproducible at around 100K iterations. At around 50K, the model should produce reasonable novel view image. Could you share some examples of inference with your current checkpoint? Note that RTMV is an extremely challenging dataset typically used under multiview settings (50-150 views).

ruoshiliu avatar Jul 16 '23 13:07 ruoshiliu

Thanks very much for the response.

Regarding the eval code, my implementation is at https://github.com/kylesargent/zero123/blob/e72542d07c13e0aca2809cb26c01c63889e920ea/zero123/ldm/models/diffusion/ddpm.py#L629. I would really appreciate it if you could take a look. The default eval parameters which I used are here: https://github.com/kylesargent/zero123/blob/e72542d07c13e0aca2809cb26c01c63889e920ea/zero123/configs/sd-objaverse-finetune-c_concat-256.yaml#L45

Just let me know if there are any discrepancies between this and your scripts and I can check the results with any necessary modifications.

Regarding the retrain: actually, now that my model has reached 42.5K iterations, the PSNR, SSIM and LPIPS are quite close to rows 2 and 3 of the table above. So I believe it's reasonable that running for the full 105K steps will get those numbers. So I guess now my main question is how to fully replicate the performance given in the main tables with the existing 105000.ckpt.

Here are some visual results from inference using the pretrained checkpoint. They look pretty reasonable to me:

Input: inp

GT: gt

Pred: pred

kylesargent avatar Jul 19 '23 05:07 kylesargent