torch-bakedsdf Which version of Torch? With 2.3 and 2.0.1 I get no convergence at all

Which version of Torch? With 2.3 and 2.0.1 I get no convergence at all

Open ptc-lexvandersluijs opened this issue 9 months ago • 2 comments

Hi,

This works looks extremely interesting for a small hobby 3D scanner project I'm working on, but I can't get it to work. From the paper and the project site, the results look more promising than any other AI multi-view reconstruction research I've seen, and I've tried almost all open source variants by now I believe. So I'm hoping that someone is able to give some hints to get it to work.

When I try to run with the bakedsdf-colmap configuration and the garden dataset, using CUDA 12.1 and PyTorch 2.3.0 OR CUDA 11.8 and PyTorch 2.1.0, the system loads and runs, and then doesn't work at all: the images in the 'save' folder look the same at iteration 5000 and iteration 170000 (see attachments below).

This is on Ubuntu 20.04 (on dual-boot laptop, so not via WSL) with an RTX 3080 16 GB.

I do see these warnings after starting:

  | Name  | Type          | Params
----------------------------------------
0 | model | BakedSDFModel | 41.9 M
----------------------------------------
41.9 M    Trainable params
0         Non-trainable params
41.9 M    Total params
83.865    Total estimated model params size (MB)
Epoch 0: : 499it [02:51,  2.90it/s, loss=0.687, train/inv_s=20.10, train/num_rays=1664.0]
/home/lex/anaconda3/envs/torch-bakedsdf/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:149: 
UserWarning: The epoch parameter in `scheduler.step()` was not necessary and is being deprecated 
where possible. Please use `scheduler.step()` to step the scheduler. During the deprecation, if 
epoch is different from None, the closed form is used instead of the new chainable form, where 
available. Please open an issue if you are unable to replicate your use case: https://github.com/
pytorch/pytorch/issues/new/choose. warnings.warn(EPOCH_DEPRECATION_WARNING, UserWarning)

Epoch 0: : 1001it [03:22,  4.95it/s, loss=0.683, train/inv_s=20.10, train/num_rays=1658.0]
[W reducer.cpp:1346] Warning: find_unused_parameters=True was specified in DDP constructor, 
but did not find any unused parameters in the forward pass. This flag results in an extra 
traversal of the autograd graph every iteration,  which can adversely affect performance. 
If your model indeed never has any unused parameters in the forward pass, consider turning 
this flag off. Note that this warning may be a false positive if your model has flow control 
causing later iterations to have unused parameters. (function operator())
Epoch 0: : 5002it [08:42,  9.57it/s, loss=0.617, train/inv_s=20.10, train/num_rays=1661.0]
/home/lex/anaconda3/envs/torch-bakedsdf/lib/python3.9/site-packages/pytorch_lightning/
trainer/connectors/logger_connector/result.py:539: PossibleUserWarning: It is recommended 
to use `self.log('val/psnr', ..., sync_dist=True)` when logging on epoch level in distributed 
setting to accumulate the metric across devices.

During training, the loss stays at approx. 0.6 and the PSNR at 11.80 Could the non-convergence be caused by some incompatibility in the Torch version? Or am I using an incorrect configuration file? Which version of PyTorch was used by the authors?

Any help would be greatly appreciated.

Below are the images from the 'save' directory, the first is saved at 10,000 iterations, the second at 170,000. it10000-0 it170000-0

May 18 '24 18:05 ptc-lexvandersluijs

torch-bakedsdf torch-bakedsdf copied to clipboard

Which version of Torch? With 2.3 and 2.0.1 I get no convergence at all

torch-bakedsdf
torch-bakedsdf copied to clipboard