Compatibility between abs and distributed training

Open Metro1998 opened this issue 7 months ago • 1 comments

Thank you very much for open-sourcing such an excellent repo. I am currently trying to tackle large-scale scene reconstruction problems using distributed training. Previously, I found the abs trick to be quite useful. My question is: why are distributed training and abs incompatible? Is this due to theoretical reasons or technical implementation difficulties? @liruilong940607

May 22 '25 11:05 Metro1998

https://github.com/nerfstudio-project/gsplat/blob/0b4dddf04cb687367602c01196913cde6a743d70/gsplat/rendering.py#L536

In all_to_all_tensor_list https://github.com/nerfstudio-project/gsplat/blob/0b4dddf04cb687367602c01196913cde6a743d70/gsplat/distributed.py#L243-L245

means2d.grad can be attained after loss.backward() since distF.all_to_all support grad all_to_all. However not means2d.absgrad. We can manually all_to_all means2d.absgrad after loss.backward()

The source code of _AlltoAll : https://github.com/pytorch/pytorch/blob/28796f71d04302029290f473a286efc2aba339c2/torch/distributed/nn/functional.py#L374

Jun 06 '25 09:06 zerolover