sdfstudio icon indicating copy to clipboard operation
sdfstudio copied to clipboard

The more GPUs, the slower the training speed

Open xiemeilong opened this issue 1 year ago • 4 comments

Describe the bug 4 4090 GPUs , train-num-rays-per-batch 512 eval-num-rays-per-batch 128

ns-train bakedangelo --machine.num-gpus 4 --pipeline.model.level-init 8 --trainer.steps-per-eval-image 5000  --pipeline.datamanager.train-num-rays-per-batch 512 --pipeline.datamanager.eval-num-rays-per-batch 128 --pipeline.model.sdf-field.use-appearance-embedding True --pipeline.model.background-color white --pipeline.model.sdf-field.bias 0.1 --pipeline.model.sdf-field.inside-outside False --pipeline.model.background-model grid  --pipeline.model.sdf-field.log2-hashmap-size 21  nerfstudio-data --data nerfstudio/tangtou 

image

1 4090 GPU, train-num-rays-per-batch 2048 eval-num-rays-per-batch 512

ns-train bakedangelo --machine.num-gpus 1 --pipeline.model.level-init 8 --trainer.steps-per-eval-image 5000  --pipeline.datamanager.train-num-rays-per-batch 2048 --pipeline.datamanager.eval-num-rays-per-batch 512 --pipeline.model.sdf-field.use-appearance-embedding True --pipeline.model.background-color white --pipeline.model.sdf-field.bias 0.1 --pipeline.model.sdf-field.inside-outside False --pipeline.model.background-model grid --pipeline.model.sdf-field.log2-hashmap-size 21   nerfstudio-data --data nerfstudio/tangtou 

image

xiemeilong avatar Jul 26 '23 03:07 xiemeilong

Hi, could you test if it's also slow with a small hash grids like --pipeline.model.sdf-field.log2-hashmap-size 18 or --pipeline.model.sdf-field.log2-hashmap-size 19.

Could you also share the output of nvidia-smi topo -m of your system?

niujinshuchong avatar Jul 26 '23 20:07 niujinshuchong

Reducing the log2-hashmap-size will result in improved performance. The Train Iter (time) is directly proportional to the number of graphics cards.

nvidia-smi topo -m

        GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PXB     SYS     SYS     NODE    NODE    0-17,36-53      0               N/A
GPU1    PXB      X      SYS     SYS     NODE    NODE    0-17,36-53      0               N/A
GPU2    SYS     SYS      X      PXB     SYS     SYS     18-35,54-71     1               N/A
GPU3    SYS     SYS     PXB      X      SYS     SYS     18-35,54-71     1               N/A
NIC0    NODE    NODE    SYS     SYS      X      PIX
NIC1    NODE    NODE    SYS     SYS     PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1

xiemeilong avatar Jul 27 '23 01:07 xiemeilong

Hi, the synchronisation of gradients between GPUs is the bottleneck when there are a lot of learnable parameters (log2-hashmap-size >=21).

niujinshuchong avatar Jul 27 '23 09:07 niujinshuchong

@niujinshuchong Even when reducing the log2-hashmap-size to 4, multiple GPUs still perform slower than a single GPU.

1 4090 log2-hashmap-size=4 train-num-rays-per-batch 2048 eval-num-rays-per-batch 512 image

4 4090 log2-hashmap-size=4 train-num-rays-per-batch 2048 eval-num-rays-per-batch 512 image

xiemeilong avatar Jul 28 '23 05:07 xiemeilong