instant-ngp icon indicating copy to clipboard operation
instant-ngp copied to clipboard

Unstable training results

Open Fangkang515 opened this issue 3 years ago • 11 comments

Thanks for your astonishing work. In my own datasets, I used exactly the same parameters to train many times, but the PSNR results were different each time. For example, the three results of PSNR are 23.81, 25.17, 21.43. Is any randomness introduced into the code? Should I avoid this situation?

Fangkang515 avatar Jan 20 '22 15:01 Fangkang515

Hi there, training should be approximately deterministic since all the random numbers are seeded.

I am saying "approximately", because floating point addition isn't commutative and the order of gradient accumulation depends on thread scheduling, so there can be slight numerical differences. These appear to be somewhat more significant in the NeRF setting than in others.

I'll double-check the codebase for other potential sources of non-determinism -- it has been a while since I verified this last time.

In the interim: it would be helpful to see how much of a visual difference the non-determinism makes in your case and to learn about your training parameters. Do you see these large differences after training for a few seconds, or after a few minutes?

Tom94 avatar Jan 20 '22 15:01 Tom94

I did train a nerf model. I can see these large differences after training for a few seconds. And even if I increase the training time, such as 20 minutes, the results are still different. The following are the three training results under the same parameters:

image

The visual difference is also relatively large, and the three results are as follows: image

Hi, @Tom94. Do you have any solution for the above unstable training results? I checked the code but didn't find a way to avoid it.

Fangkang515 avatar Jan 21 '22 02:01 Fangkang515

I have the same problem with the unstable result, I ran three times in a same images datasets, the first time I can see a clear outline model, but the second and the third is a totally mess result.

StarsTesla avatar Feb 22 '22 07:02 StarsTesla

Are you testing on your own dataset? Does its results good? I don't get good result on my own dataset, I want to know why?

qhdqhd avatar Feb 25 '22 13:02 qhdqhd

I've been training NeRFs with this code recently, and have made a few changes to the COLMAP extraction method, which appears to help significantly with training at the expense of both (relatively minor) added compute time and (more major) RAM usage.

slash-under avatar Feb 27 '22 21:02 slash-under

slash-under/instant-ngp@efdd42a851039b689b84b84191e0358dbd35f07d

It's about 1-1.25 GB of RAM per thread at peak usage during initial extraction for about 250 3840x2160 frames, but greatly improves NeRF model quality with large datasets in my experience.

Sample:

https://user-images.githubusercontent.com/63025235/155902826-58a972c6-4891-4ef5-88c5-1026180abf41.mp4

slash-under avatar Feb 27 '22 22:02 slash-under

That's very good to know, thank you very much for pointing it out! (Also love the video!)

Tom94 avatar Feb 28 '22 19:02 Tom94

I see that there was a commit pushed to reduce memory usage -- hopefully the data for the next model might fit now with it. Is there any way one could utilize more than one GPU if the dataset doesn't fit into VRAM?

Edit: looking a bit further, maybe I can use the PyTorch bindings of the underlying library to accomplish something similar.

slash-under avatar Mar 02 '22 21:03 slash-under

@Tom94 Could we please have the spam removed? I sent a report to GitHub, so now we're waiting on a response from them...

slash-under avatar Mar 30 '22 03:03 slash-under

@slash-under great video! Could you please give some hints on parameters set for aerial photos? I followed the tutorial on custom datasets but still struggling to get anything clear enough

lukszamarcin avatar Apr 24 '22 19:04 lukszamarcin

I'll double-check the codebase for other potential sources of non-determinism -- it has been a while since I verified this last time.

Hi @Tom94, thanks for your work! Are there any updates on this? I'm training some NeRFs on custom data and measuring the cosine similarity between the weights of the different NeRF instances trained on the same images/depths. This is always ~0.75. Do you think that this can be somehow increased?

fedeceola avatar Mar 21 '24 10:03 fedeceola