sdfstudio
sdfstudio copied to clipboard
The more GPUs, the slower the training speed
Describe the bug 4 4090 GPUs , train-num-rays-per-batch 512 eval-num-rays-per-batch 128
ns-train bakedangelo --machine.num-gpus 4 --pipeline.model.level-init 8 --trainer.steps-per-eval-image 5000 --pipeline.datamanager.train-num-rays-per-batch 512 --pipeline.datamanager.eval-num-rays-per-batch 128 --pipeline.model.sdf-field.use-appearance-embedding True --pipeline.model.background-color white --pipeline.model.sdf-field.bias 0.1 --pipeline.model.sdf-field.inside-outside False --pipeline.model.background-model grid --pipeline.model.sdf-field.log2-hashmap-size 21 nerfstudio-data --data nerfstudio/tangtou
1 4090 GPU, train-num-rays-per-batch 2048 eval-num-rays-per-batch 512
ns-train bakedangelo --machine.num-gpus 1 --pipeline.model.level-init 8 --trainer.steps-per-eval-image 5000 --pipeline.datamanager.train-num-rays-per-batch 2048 --pipeline.datamanager.eval-num-rays-per-batch 512 --pipeline.model.sdf-field.use-appearance-embedding True --pipeline.model.background-color white --pipeline.model.sdf-field.bias 0.1 --pipeline.model.sdf-field.inside-outside False --pipeline.model.background-model grid --pipeline.model.sdf-field.log2-hashmap-size 21 nerfstudio-data --data nerfstudio/tangtou
Hi, could you test if it's also slow with a small hash grids like --pipeline.model.sdf-field.log2-hashmap-size 18
or --pipeline.model.sdf-field.log2-hashmap-size 19
.
Could you also share the output of nvidia-smi topo -m
of your system?
Reducing the log2-hashmap-size will result in improved performance. The Train Iter (time) is directly proportional to the number of graphics cards.
nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PXB SYS SYS NODE NODE 0-17,36-53 0 N/A
GPU1 PXB X SYS SYS NODE NODE 0-17,36-53 0 N/A
GPU2 SYS SYS X PXB SYS SYS 18-35,54-71 1 N/A
GPU3 SYS SYS PXB X SYS SYS 18-35,54-71 1 N/A
NIC0 NODE NODE SYS SYS X PIX
NIC1 NODE NODE SYS SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
Hi, the synchronisation of gradients between GPUs is the bottleneck when there are a lot of learnable parameters (log2-hashmap-size >=21).
@niujinshuchong Even when reducing the log2-hashmap-size to 4, multiple GPUs still perform slower than a single GPU.
1 4090 log2-hashmap-size=4 train-num-rays-per-batch 2048 eval-num-rays-per-batch 512
4 4090 log2-hashmap-size=4 train-num-rays-per-batch 2048 eval-num-rays-per-batch 512