Open-Llama icon indicating copy to clipboard operation
Open-Llama copied to clipboard

training speed questions: multi-node training, CPU offload

Open jli opened this issue 1 year ago • 2 comments

First of all, thank you for this project! I had some questions about how training was done, as I've struggled to scale up training larger model sizes when using transformers + deepspeed on the original llama weights.

  1. I'm curious if you've tested multi-node distributed training? From what I understand, all the training speed numbers listed in your 2023-05-08 v2.1 release are from training on a single node with 8x A100 80GB. Is that correct?

  2. Could you share other aspects of the machine configuration (eg, number of CPUs and amount of CPU memory)? In particular, when using CPU offloading for 65B, how much CPU memory was needed?

  3. From the v2.1 release notes, it looks like you have very close to linear slowdown when increasing the model size. It's surprising to me that, even as you have to move from DS-1 to DS-3 (for 13, 33, 65B) and enable CPU offloading (for 65B), you don't get more of a slowdown due to additional overhead. Can you comment on how you were able to achieve this?

Thank you!

jli avatar May 15 '23 23:05 jli

  1. The current test results are the results of multi-machine parallel training on 48 cards. The single-machine I wrote in the readme before is actually incorrect, sorry.
  2. The hardware I use is (8xA100-80G 64cpu 1T memory) x 6, connect using nvlink. When using the above configuration, the cpu memory uses about 450G on average.
  3. I was also very surprised when I first saw this. Later, it was compared with the original text of llama, and it was confirmed that this was roughly in line with expectations. There are two points that I think need to be noticed, one is that I used a gradient accumulation of 12 in all cases, which greatly reduces the communication cost. Another point is that I use gradient checkpoint for 13B and above models, which is similar to gradient accumulation and reduces communication consumption. Finally, I tested the 7B model using several configurations on a single machine. The actual training speed is similar. My guess is that when using stage3+offload, although there are more cpu and gpu communications, the batch size is larger and reduced. The number of communications between different GPUs. image You can reproduce this test result using utils/speed_test/accelerate.

s-JoL avatar May 16 '23 06:05 s-JoL

Thank you for sharing those details 🙏

I'm not able to get access to A100s reliably, so I've been trying to do distributed llama fine-tuning on A40s. One thing I'm concerned about is that A40s do not have NVLink. Do you think this will cause massive slowdown?

I don't understand how DeepSpeed arranges what GPU does what work. I assume that it tries to intelligently arrange the topology of the parallelism so that more communication happens between GPUs on the same node rather than across the network. Do you know how to debug/inspect/visualize what parts of the model are being sent to which GPUs?

jli avatar May 16 '23 12:05 jli