DeepSpeed [BUG] zero2 and zero3 has different behavior using the same hyperparameter to train a large model

Describe the bug Zero3 and zero2 grad_norm are totally not in the same scale, which eventually leads to the crash of zero2's loss.

Expected behavior Zero2 should have the same smoothing loss as that of zero3.

ds_report output


Please run `ds_report` to give us details about your setup.
[2023-09-10 20:53:54,125] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.9/dist-packages/torch']
torch version .................... 2.1.0.dev20230424+cu117
deepspeed install path ........... ['/usr/local/lib/python3.9/dist-packages/deepspeed']
deepspeed info ................... 0.10.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.1, cuda 11.7

Additional context The training will crash after one day or two. Hard to reproduce.

Sep 10 '23 12:09 superaha

Can you please share steps to repro? Also, can you confirm that the loss scaling is initialized similarly?

Sep 11 '23 02:09 tjruwase

By comparing deepspeed stage-2 with native torch DDP (both fp32 training), I also encountered a similar problem, I think there must be some gap in the smoothing strategy between DS stage-2 and DDP thus leading to different scale of loss (or grad_norm)

you can see that within the same training config, the grad_norm of deepspeed (stage-2) is about 5 times higher than DDP, and this will eventually cause a diverge issue.

Also, if we scale up to more gpus (1node 8gpu -> 2node 16 gpu), we can immediately see a huge mismatch and loss crash:

my ds_report:

[2023-11-02 17:07:48,189] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/site-packages/torch']
torch version .................... 1.13.0+cu116
deepspeed install path ........... ['/usr/local/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.11.1, unknown, unknown
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6
shared memory (/dev/shm) size .... 125.81 GB

Nov 02 '23 02:11 xingchensong

According to @superaha's observations, I'm happy to run a double-check experiment on stage-1 & stage-3 to prove that this issue only occurs at stage-2, and I could give a minimal repro recipe if I have time or even a PR to fix this issue. cc @tjruwase

Nov 02 '23 02:11 xingchensong

By comparing deepspeed stage-2 with native torch DDP (both fp32 training), I also encountered a similar problem, I think there must be some gap in the smoothing strategy between DS stage-2 and DDP thus leading to different scale of loss (or grad_norm)

you can see that within the same training config, the grad_norm of deepspeed (stage-2) is about 5 times higher than DDP, and this will eventually cause a diverge issue.

Also, if we scale up to more gpus (1node 8gpu -> 2node 16 gpu), we can immediately see a huge mismatch and loss crash:

Hello, recently I'm using ZeRO2 to train a llama2 13B model on 64 gpus. I also find similar issue. The loss seems going down in the first dozens of steps, then loss divergence happens. So is this a new bug imported in the recent version? Cause this is a pretty serious question and should have been noticed long ago.

Nov 02 '23 08:11 kisseternity

Besides, loss converges normally when I use Megatron for training, all the hyper-parameters the same except I set tensor parallel to 2 in Megatron, so the dp size is different. However this shouldn't make loss diverge in my opinion. I've also tried smaller lr, also doesn't work. At first I doubt maybe something is wrong in the bf16 optimizer implementation, but in this issue fp32 also triggers the problem as mentioned by @xingchensong.

Nov 02 '23 09:11 kisseternity

@kisseternity could u try stage-3 to double-check this issue?

Nov 02 '23 09:11 xingchensong

@kisseternity could u try stage-3 to double-check this issue?

As I'm using Ethernet between nodes, ZeRO3 is not a choice... When the training is done with Megatron, I may try it.

Nov 02 '23 09:11 kisseternity

Besides, loss converges normally when I use Megatron for training, all the hyper-parameters the same except I set tensor parallel to 2 in Megatron, so the dp size is different. However this shouldn't make loss diverge in my opinion. I've also tried smaller lr, also doesn't work. At first I doubt maybe something is wrong in the bf16 optimizer implementation, but in this issue fp32 also triggers the problem as mentioned by @xingchensong.

Add DeepSpeed training tensorboard(unfortunately not with grad_norm), comparing to Megatron. DeepSpeed

Megatron(With multiple runs)

Could you pls take a look at this issue again, thanks. @tjruwase

Nov 03 '23 02:11 kisseternity

double-check that there has a pretty serious question in stage-2.

stage-1 & stage-3 look good and almost equal in loss & grad_norm. cc @tjruwase @loadams

Nov 05 '23 14:11 xingchensong

I have also encountered the same problem and look forward to a solution.

Nov 10 '23 04:11 zzhanghub

Maybe this is a bug introduced recently, I will try to downgrade the deepspeed version this weekend to see if it's true.

Nov 10 '23 06:11 kisseternity

@kisseternity, @xingchensong, @zzhanghub thanks for the triaging effort on this issue. Could you please try v0.10.0 as well? Thanks!

Nov 10 '23 10:11 tjruwase

@kisseternity, @xingchensong, @zzhanghub thanks for the triaging effort on this issue. Could you please try v0.10.0 as well? Thanks!

I've tried v0.10.0 and the issue still occurred.

Nov 13 '23 08:11 kisseternity

@kisseternity, thanks for the update. We will investigate further.

Nov 13 '23 15:11 tjruwase

Update: We have investigate further. The norm difference is caused by the gradient difference. Our current solution is to turn the overlap_comm to False when using Zero2. This fixes the issue, however, makes the training slower. @tjruwase hope this may help.

Nov 14 '23 02:11 superaha

@kisseternity could u try stage-3 to double-check this issue?

I've tried ZeRO3 and it's fine. Besides the training speed is fine compared to ZeRO2 with communication overlap, better than expected. It seems that the prefetch in ZeRO3 can overlap some 1.5x communication costs, making the speed pretty fast even between Ethernet connected nodes.

Nov 14 '23 02:11 kisseternity

@kisseternity could u try stage-3 to double-check this issue?

I've tried ZeRO3 and it's fine. Besides the training speed is fine compared to ZeRO2 with communication overlap, better than expected. It seems that the prefetch in ZeRO3 can overlap some 1.5x communication costs, making the speed pretty fast even between Ethernet connected nodes.

Does 'prefetch' here refer to a parameter of the torch dataloader or a parameter of deepspeed?

Nov 14 '23 03:11 xingchensong

@kisseternity could u try stage-3 to double-check this issue?

I've tried ZeRO3 and it's fine. Besides the training speed is fine compared to ZeRO2 with communication overlap, better than expected. It seems that the prefetch in ZeRO3 can overlap some 1.5x communication costs, making the speed pretty fast even between Ethernet connected nodes.

Does 'prefetch' here refer to a parameter of the torch dataloader or a parameter of deepspeed?

Here I mean the prefetch of next layer of the model to do all_gather, you can refer to stage3_prefetch_bucket_size for reference. The prefetch of dataloader also overlap the process time of data, but the main communication time costs lay in the all_gather and reduce ops in llm with zero3.

Nov 14 '23 03:11 kisseternity

@kisseternity could u try stage-3 to double-check this issue?

I've tried ZeRO3 and it's fine. Besides the training speed is fine compared to ZeRO2 with communication overlap, better than expected. It seems that the prefetch in ZeRO3 can overlap some 1.5x communication costs, making the speed pretty fast even between Ethernet connected nodes.

Does 'prefetch' here refer to a parameter of the torch dataloader or a parameter of deepspeed?

Here I mean the prefetch of next layer of the model to do all_gather, you can refer to stage3_prefetch_bucket_size for reference. The prefetch of dataloader also overlap the process time of data, but the main communication time costs lay in the all_gather and scatter ops in llm.

great, thx

Nov 14 '23 03:11 xingchensong

Update: We have investigate further. The norm difference is caused by the gradient difference. Our current solution is to turn the overlap_comm to False when using Zero2. This fixes the issue, however, makes the training slower. @tjruwase hope this may help.

@superaha, thanks for this valuable tip.

Nov 14 '23 14:11 tjruwase

Hi, teams, any update?

Nov 28 '23 03:11 xingchensong

Has there been any recent progress? 🤔️

Dec 05 '23 02:12 zzhanghub

I am convinced that this issue is of significant concern. Have you identified any potential solutions?

Mar 18 '24 03:03 patrick-tssn

This issue still exists in 0.14.0. Zero2 has a much larger grad norm than Zero3.

May 20 '24 14:05 mutonix

DeepSpeed DeepSpeed copied to clipboard

[BUG] zero2 and zero3 has different behavior using the same hyperparameter to train a large model

DeepSpeed
DeepSpeed copied to clipboard