RAD-NeRF
RAD-NeRF copied to clipboard
Training time is too long
Hi, I found that it takes 0.07 seconds to train a step forward, but 15 seconds to backward,thus the total time will take several days. Is there something wrong with my training code!
This is strange, what's the environment (e.g., GPU, CUDA, OS) you use? How do you measure the time?
this is the training print info:
it will take 27 hours per epoch. the environment is V100 32G, is that normal?
No... it should take less than 10 hours to finish all epochs on V100. Are you facing this slow speed problem when training other DL models (e.g., resnet)?
Is there a way to speed up the training process, with only 1 GPU ?
The current training speed doesn't have much space to improve I guess. Maybe you could increase the num_rays
and train less steps, but this may scarifice performance. Also, you may try torch 2.0's compile
.
I have CUDA version 11.7, A100 40GB, and I'm having an issue where it's taking 4-5 hours per epoch.
@yediny Could you provide the command you use? If you have enough GPU memory, you could try to use --preload 2
to see if the speed bottleneck is image loading.
Even with preload applied.. it is still the same speed.
Is there any way to make the most use of gpu memory for training?
And under the normal case, does it take a day to training one model?
how to use the torch.compile accelerate training?
it seems we should use LightningModule to rewrite the nn.Module and then use the compile?