litgpt icon indicating copy to clipboard operation
litgpt copied to clipboard

Gradient Accumulation Step under Multi-node Pretaining

Open SHUMKASHUN opened this issue 1 year ago • 9 comments

@awaelchli I found that in the pretrain.py, the accumulation steps are calculated based on global batch size, device number and micro batch size. This works fine under single-node setting, e.g. global batch size = 1024, and device number = 8, micro batch size = 16. The gradient accumulation step is just 1024/8/16 = 8. However, it seems the script does not consider the multi-node setting? If I use two node to train, the gradient accumulation step is still 8 (it will still treat device = 8 ) I am wondering should I manually change this accumulation step in the code? Thank you for any suggestions.

SHUMKASHUN avatar Jun 10 '24 00:06 SHUMKASHUN

Good question, intuitively, I'd say that's a good point. @awaelchli what are your thoughts here? I think you have some experience running pretraining on multi-node.

rasbt avatar Jun 13 '24 01:06 rasbt

The global_batch_size is global across all devices in a machine. It is per machine. We did this out of convenience so that you can first optimize your training for single node and then scale out to multiple without having to change much else. The alternative is to make the global_batch_size global across all devices, and recompute the other values based on that.

In my view, the second approach has more practical disadvantages than the first. For example, I would find it very annoying to choose a value for global batch size that is evenly divisible by the number of devices and micro batch size.

If the current name is a problem, we can also rename the variable.

awaelchli avatar Jun 14 '24 19:06 awaelchli

Thank you so much for the explanation. It will be good to add a note about this global_batch_size to be per machine in the readme. Because people may easily keep this global_batch_size unchanged when extend to multiple machines.

SHUMKASHUN avatar Jun 15 '24 10:06 SHUMKASHUN

Yes I agree. We could mention it here at least: https://github.com/Lightning-AI/litgpt/blob/76c88950f8bdb59f87ad6a870409f655956e725b/litgpt/args.py#L16-L17 Would you like to do it?

awaelchli avatar Jun 17 '24 04:06 awaelchli

Maybe add extra line in config yaml file?

SHUMKASHUN avatar Jun 18 '24 08:06 SHUMKASHUN

Hi @SHUMKASHUN Thanks for this great question. If I understand correctly, if I am using 8 nodes, the global_batch_size needs to be divided by 8 of that in one node setting to achieve the similar performance in the same step, right?

yuzc19 avatar Jul 03 '24 07:07 yuzc19

if I am using 8 nodes, the global_batch_size needs to be divided by 8 of that in one node setting to achieve the similar performance in the same step, right?

Right yes, if you just want to get the same exact results on 8 vs 1 nodes, then you would do that. But of course, there is no practical benefit to that because you would use 8x more resources and not get any speed up in training compared to 1 node.

awaelchli avatar Jul 03 '24 15:07 awaelchli

if I am using 8 nodes, the global_batch_size needs to be divided by 8 of that in one node setting to achieve the similar performance in the same step, right?

Right yes, if you just want to get the same exact results on 8 vs 1 nodes, then you would do that. But of course, there is no practical benefit to that because you would use 8x more resources and not get any speed up in training compared to 1 node.

Thank you! I think it will speed up since the gradient accumulation iters will also be divided by 8 in this case. So for one optimization step, one node will go through fewer iterations.

yuzc19 avatar Jul 03 '24 17:07 yuzc19

if I am using 8 nodes, the global_batch_size needs to be divided by 8 of that in one node setting to achieve the similar performance in the same step, right?

Right yes, if you just want to get the same exact results on 8 vs 1 nodes, then you would do that. But of course, there is no practical benefit to that because you would use 8x more resources and not get any speed up in training compared to 1 node.

Thank you! I think it will speed up since the gradient accumulation iters will also be divided by 8 in this case. So for one optimization step, one node will go through fewer iterations.

@yuzc19 @awaelchli In that case, did you get the same results using 8 nodes? I got performance degradation when using multi-node pretraining. (https://github.com/Lightning-AI/litgpt/issues/1836) Is there anything I should be aware of?

HaebinShin avatar Nov 25 '24 03:11 HaebinShin

hello, I am a little confused reading through. if i have tuned my global batch size/micro batch size and learning rate on a single node with 4 gpus, i dont need to change them anymore when switching to 4 nodes, right?

2533245542 avatar May 30 '25 15:05 2533245542