mvsnerf icon indicating copy to clipboard operation
mvsnerf copied to clipboard

NeRF 9.5h results

Open kwea123 opened this issue 3 years ago • 11 comments

Which implementation do you use to generate these results? They seem much worse than the results reported in the paper and the results I got with my implementation.

Screenshot from 2021-07-07 00-18-45 Screenshot from 2021-07-07 00-19-23

Here's my result with horns, PSNR reaches already 31.6 after 9.5h (reaches 25.91 after 22mins btw). I believe there should be some simplification in the implementation that you adopt, because I implement almost the same way as the original paper does, and adopt the same hyperparameters.

kwea123 avatar Jul 06 '21 15:07 kwea123

Hi kwea123, thanks for your attention, previously, I evaluate nerf with your implementation, I think there are two main differences here: 1) we trained the scenes with 16 training views instead of all images, 2) in the LLFF/nerf scenes, we didn't use the ndc ray to align their near-far range with the DTU dataset.

apchenstu avatar Jul 06 '21 16:07 apchenstu

recently I found the NDC ray sampling influent the quality of LLLFF a lot, so I am re-evaluating them and also IBRNet with their official code. I will update the result in our revision.

apchenstu avatar Jul 06 '21 16:07 apchenstu

  1. we trained the scenes with 16 training views instead of all images

I see, maybe that makes difference too, here I use all except one image, so 61 images.

kwea123 avatar Jul 07 '21 00:07 kwea123

According to the paper, "In this work, we parameterize (u, v, z) using the normalized device coordinate (NDC) at the reference view." just after equation 4. And many other statements mention that you use NDC (for your method).

So I'm confused about why you didn't use NDC for LLFF/nerf?

kwea123 avatar Jul 07 '21 12:07 kwea123

  1. in the LLFF/nerf scenes, we didn't use the ndc ray to align their near-far range with the DTU dataset.

Oh, we didn't use ndc rays means we didn't pre-process the dataset, e.g., we use their real near-fars (range in nerf: 2-6, LLFF~2-12, dtu 2.125-5). this process refer to this setting

And yes we are using NDC position of the encoding volume (reference view), e.g., the xyz_NDC of this line

apchenstu avatar Jul 07 '21 13:07 apchenstu

May I also ask a question about the implementation here? I see, that you're using 3 neighbouring views in the MVSNet to create feature volume. However, the original MVSNet used a reference image (the one we need to render and lack when doing inference). According to your code, that you projected all images onto image plane with index 0, which, I believe is a source image. This contradicts your paper, could you, please clarify that moment? This is the code part I'm writing about: image

oOXpycTOo avatar Jul 09 '21 10:07 oOXpycTOo

Hi oOXpycTOo, we use three neighboring views (e.g. source views) to create the feature volume means the input is three images and they have a related small baseline, and we project other two images’ feature to the ref view(index 0 view), i hope this answers your question, thanks

apchenstu avatar Jul 09 '21 12:07 apchenstu

Yes, but that's quite different from the original MVS Net idea, where they projected these features onto the predicted image plane. What is the physical meaning of projecting other two features onto one of the source views?

oOXpycTOo avatar Jul 09 '21 13:07 oOXpycTOo

oh I got your point, yeah it's different, we think building volume in the target view may provide a better depth quality but is not an efficient way to do the free-viewpoint rendering. I think the projection is the homography plane wrapping.

apchenstu avatar Jul 10 '21 08:07 apchenstu

Hi kwea123, thanks for your attention, previously, I evaluate nerf with your implementation, I think there are two main differences here: 1) we trained the scenes with 16 training views instead of all images, 2) in the LLFF/nerf scenes, we didn't use the ndc ray to align their near-far range with the DTU dataset.

I would like to ask what did you mean when you said "trained the scenes with 16 training views" here? In the dataset you give DTU training data, it seems that you trained each scan with 49 pairs of views(1 target and 3 source views). So what do I understand wrong?

zcong17huang avatar Jul 13 '21 09:07 zcong17huang

Oh, the statement refers to the fine-tuning stage and the fine-tuning pairing file is in the 'configs/pairs.th', thanks~

apchenstu avatar Jul 13 '21 09:07 apchenstu