neuralangelo Multiple GPU training is slow.

Dear author, Thanks for seeing this question. When I trained toy_example with an eight-card 4090GPU server, I found that the training speed was not very fast. Similar to single card training. And it takes more than 100 hours to complete 500,000 epochs for single-card training. I understand that eight-card distributed training should take more than ten hours to complete, but the actual speed is not much faster. What is the reason for this?

Aug 22 '23 03:08 Ziba-li

I only need 24 hours to train 500,000 epochs with a single 4090 GPU. However, when I use 4x 4090 GPUs, the training speed is actually only a quarter of using 1 GPU. I have the same issue with unofficial implementations. I guess it's because the model size is so big that the communication overhead between GPUs for syncing gradients is too large.

Aug 22 '23 05:08 xiemeilong

I only need 24 hours to train 500,000 epochs with a single 4090 GPU. However, when I use 4x 4090 GPUs, the training speed is actually only a quarter of using 1 GPU. I have the same issue with unofficial implementations. I guess it's because the model size is so big that the communication overhead between GPUs for syncing gradients is too large.

Did you run the following script to get the data?

EXPERIMENT_NAME=toy_example
PATH_TO_VIDEO=toy_example.MOV
SKIP_FRAME_RATE=24
SCENE_TYPE=object #{outdoor,indoor,object}
bash projects/neuralangelo/scripts/preprocess.sh ${EXPERIMENT_NAME} ${PATH_TO_VIDEO} ${SKIP_FRAME_RATE} ${SCENE_TYPE}

I ran it according to the process, but I don't understand why it takes 2.3 seconds for an epoch to train on a 4090 GPU. 500,000 epochs will take 319 hours.

Aug 22 '23 06:08 Ziba-li

I used my own data, only over 500 images.

Aug 22 '23 07:08 xiemeilong

I used my own data, only over 500 images.

The toy data I used actually only has 29 valid pictures. I don't understand why it takes 319 hours to complete 500,000 epochs.

Aug 22 '23 07:08 Ziba-li

+1. How can I make training faster? It's impossible to run this in multiple videos as each takes so long. I'm looking for faster training for each video

Aug 22 '23 07:08 smandava98

Hi @Ziba-li, the multi-GPU setup (i.e. distributed training) enables training with larger batch sizes. It doesn't increase the per-iteration training speed, but it will be much faster to train each epoch. If you want to look into training with less iterations, this may be helpful.

Aug 22 '23 08:08 chenhsuanlin

Hi @Ziba-li, the multi-GPU setup (i.e. distributed training) enables training with larger batch sizes. It doesn't increase the per-iteration training speed, but it will be much faster to train each epoch. If you want to look into training with less iterations, this may be helpful.

Thank you for your reply, but I still haven't understood why it takes a very long time for a single GPU to run for 500,000 epochs, instead of running 16 hours for the A100 24G to get the results.

Aug 22 '23 10:08 Ziba-li

I have the same problem. When I use single 3090 gpu running the demo lego, it takes ~9secs an epoch. Howerver, when I use 3090x4 to train the model, it takes ~23secs, which is very strange.

Aug 23 '23 05:08 otakudj

I just tried with A100-PCIE-40GB on AutoDL platform. With single GPU, the training time for lego.mp4 is as the following: img_v2_2509e97c-db3d-4205-b8bd-0b6e803aa97g

For the hyperparameters part, I only update the dict_size from 22 to be 21.

Aug 25 '23 16:08 prettybot

I'm not sure about the communication overhead of 4090, but we didn't see such issue with A100. If you could help pinpoint where the additional overhead is coming from (and verifying that it is indeed coming from the gradient synchronization part), I can put up a note on that.

Also a minor note that we are measuring by iterations (500k) instead of epochs in the codebase.

Aug 26 '23 05:08 chenhsuanlin

@chenhsuanlin thanks a lot for your explaination about the 500k part.

Aug 26 '23 05:08 prettybot

I just tried with A100-PCIE-40GB on AutoDL platform. With single GPU, the training time for lego.mp4 is as the following:

For the hyperparameters part, I only update the dict_size from 22 to be 21.

Hello, I trained the lego-demo with 2x A100-PCIE-80GB. But I really get the bad time consumption result as below. For each epoch I need 7s.

Oct 28 '23 13:10 uu5208