ACORN Training speed

Hi, I'm training a 3d model (engine) by your code. And I completely followed the steps in README. But the code runs too slow (more than 1000 hours to finish). So where the problem is ? (I used a GeForce RTX 2080 Ti)

Sep 09 '21 08:09 random649

The maximum number of iterations in the training script is probably much more than is necessary for the model to converge. How many iterations are you running it for? Does the loss decrease and begin to converge?

The training script will save out model checkpoints at intermediate points, and you can try saving out meshes from these models to see how they look.

Sep 09 '21 16:09 davelindell

I haven't been able to replicate this issue, so closing for now. Please follow up if this is still a problem.

Nov 12 '21 18:11 davelindell

I have the same issue training the 3D models. Currently, I used the default epochs in the thai statue config file, which is 10000 epochs. It trained on v100 for 24hours but only got 60000 iterations done, which is not many numbers of epochs. I exported the dae mesh and it didn't seem like it's converged. Following is a snapshot of that dae mesh in meshlab:

Apr 14 '22 06:04 xindonglin99

Training to 60,000 iterations should yield a better result than what you're showing above, so something seems off. We optimize to 48k iters in the paper for the Thai Statue and it looks much more detailed than the above.

Also, it seems strange that it takes 24 hours to get to 60k iterations. I can run around 10 it/s on my laptop GPU (GTX 1650), and at this rate it should only take a couple hours to reach 60k. I guess a V100 should be even faster.

Are you sure you are using the default config file without any changes from the repo? How many workers are you using for the dataloader? Can you also post the tensorboard summaries for the occupancy loss?

Apr 14 '22 11:04 davelindell

I'm sure that I used the config cloned from the repo. The followings are loss for thai statue, my config used

I remembered the speed is around 3-4 iters/s or even below. I initially suspect that the GPU is not utilized during the training but I check 'torch.cuda.is_avaiable() = True' . It's weird to have this effect. Thanks for the help in advance.

PS: Is the training 3D mesh watertight? Does this matter? We used watertight mesh to train.

Apr 14 '22 17:04 xindonglin99

Hmm, unfortunately I'm still having a hard time reproducing this.

One observation is that my occupancy loss curves look very different compared to yours. It's almost as if the block optimization is not happening at all in your case. You should see the error spike a bit at intervals where the block optimization is done. The loss also doesn't go down monotonically because as the blocks subdivide, there are more blocks and hence more fitting error.

loss

I followed the below steps:

re-downloaded the repo
created a new conda environment using the instructions in the README.md
downloaded the thai statue PLY file from the Stanford 3D Scanning repo (ideally the models should be watertight, but in practice it seems to work if the meshes are close to watertight)
re-ran the training using the config_thai_acorn.ini file

I get the below result after 20K iterations (I export the mesh and visualize using meshlab). This took an hour or so on an old Titan X GPU.

thai statue

Apr 15 '22 06:04 davelindell

Thank you for your feedback. It seems the issue still happened on our side even though I try to replicate using the process you mentioned. Either

The exported mesh still looks bad regarding details if I use the thai_statue mesh from other sources.
There are some errors (number of octants > maximum octants, problem infeasible) during training if I use the mesh downloaded from the Stanford webpage. I changed the max octants to 8192 but it still doesn't help.

I will dig a little bit more into this and will update if I find the issue. Thank you for your help again!

Apr 19 '22 23:04 xindonglin99

Hmm this is strange, and it's hard to diagnose since I can't reproduce it. A few other thoughts:

Maybe there is some difference in the hardware or python packages?

Otherwise did you rename the downloaded model file to thai_statue.ply in the data directory?

Does it work if you try on a different machine?

Apr 22 '22 01:04 davelindell