Chen-Hsuan Lin
Chen-Hsuan Lin
Hi @Ziba-li, the multi-GPU setup (i.e. distributed training) enables training with larger batch sizes. It doesn't increase the per-iteration training speed, but it will be much faster to train each...
I'm not sure about the communication overhead of 4090, but we didn't see such issue with A100. If you could help pinpoint where the additional overhead is coming from (and...
In our toy example and the Colab, we use the test set of the Lego sequence instead of the training set. This is to simulate a smooth camera trajectory that...
Hi @xiemeilong, in addition to the above @mli0603 mentioned, we also have a fix (#41) on the scripts. If you were extracting the mesh from an earlier checkpoint, please pull...
@xiemeilong @zz7379 we have pushed an update to `main` yesterday that fixed a checkpoint issue which may be related. Could you pull and try running the pipeline again? Please let...
@yuxuJava789 if you are training with the default config, this is expected at 20k iterations. You would need to run to 500k iterations to get the final results. If you...
@derrick-xwp your results look fine. Could you elaborate what the concern is?
Hi @ZirongChan could you post the full error log? Thanks!
This seems to be an issue on the W&B side. We don't support Tensorboard right now, but PRs are welcome if you'd like to help add this support.
To disable distributed training, you can run `python train.py --single_gpu ...` instead of `torchrun --nproc_per_node=1 train.py ...` and it should work.