Compatibility between abs and distributed training
Thank you very much for open-sourcing such an excellent repo. I am currently trying to tackle large-scale scene reconstruction problems using distributed training. Previously, I found the abs trick to be quite useful. My question is: why are distributed training and abs incompatible? Is this due to theoretical reasons or technical implementation difficulties? @liruilong940607
https://github.com/nerfstudio-project/gsplat/blob/0b4dddf04cb687367602c01196913cde6a743d70/gsplat/rendering.py#L536
In all_to_all_tensor_list https://github.com/nerfstudio-project/gsplat/blob/0b4dddf04cb687367602c01196913cde6a743d70/gsplat/distributed.py#L243-L245
means2d.grad can be attained after loss.backward() since distF.all_to_all support grad all_to_all. However not means2d.absgrad. We can manually all_to_all means2d.absgrad after loss.backward()
The source code of _AlltoAll : https://github.com/pytorch/pytorch/blob/28796f71d04302029290f473a286efc2aba339c2/torch/distributed/nn/functional.py#L374