DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] zero2 and zero3 has different behavior using the same hyperparameter to train a large model

Open superaha opened this issue 1 year ago • 24 comments

Describe the bug Zero3 and zero2 grad_norm are totally not in the same scale, which eventually leads to the crash of zero2's loss.

Expected behavior Zero2 should have the same smoothing loss as that of zero3.

ds_report output


Please run `ds_report` to give us details about your setup.
[2023-09-10 20:53:54,125] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.9/dist-packages/torch']
torch version .................... 2.1.0.dev20230424+cu117
deepspeed install path ........... ['/usr/local/lib/python3.9/dist-packages/deepspeed']
deepspeed info ................... 0.10.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.1, cuda 11.7

Additional context The training will crash after one day or two. Hard to reproduce.

superaha avatar Sep 10 '23 12:09 superaha

Can you please share steps to repro? Also, can you confirm that the loss scaling is initialized similarly?

tjruwase avatar Sep 11 '23 02:09 tjruwase

By comparing deepspeed stage-2 with native torch DDP (both fp32 training), I also encountered a similar problem, I think there must be some gap in the smoothing strategy between DS stage-2 and DDP thus leading to different scale of loss (or grad_norm)

you can see that within the same training config, the grad_norm of deepspeed (stage-2) is about 5 times higher than DDP, and this will eventually cause a diverge issue.

image

Also, if we scale up to more gpus (1node 8gpu -> 2node 16 gpu), we can immediately see a huge mismatch and loss crash:

image

my ds_report:

[2023-11-02 17:07:48,189] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/site-packages/torch']
torch version .................... 1.13.0+cu116
deepspeed install path ........... ['/usr/local/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.11.1, unknown, unknown
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6
shared memory (/dev/shm) size .... 125.81 GB

xingchensong avatar Nov 02 '23 02:11 xingchensong

According to @superaha's observations, I'm happy to run a double-check experiment on stage-1 & stage-3 to prove that this issue only occurs at stage-2, and I could give a minimal repro recipe if I have time or even a PR to fix this issue. cc @tjruwase

xingchensong avatar Nov 02 '23 02:11 xingchensong

By comparing deepspeed stage-2 with native torch DDP (both fp32 training), I also encountered a similar problem, I think there must be some gap in the smoothing strategy between DS stage-2 and DDP thus leading to different scale of loss (or grad_norm)

you can see that within the same training config, the grad_norm of deepspeed (stage-2) is about 5 times higher than DDP, and this will eventually cause a diverge issue.

image

Also, if we scale up to more gpus (1node 8gpu -> 2node 16 gpu), we can immediately see a huge mismatch and loss crash:

image

Hello, recently I'm using ZeRO2 to train a llama2 13B model on 64 gpus. I also find similar issue. The loss seems going down in the first dozens of steps, then loss divergence happens. So is this a new bug imported in the recent version? Cause this is a pretty serious question and should have been noticed long ago.

kisseternity avatar Nov 02 '23 08:11 kisseternity

Besides, loss converges normally when I use Megatron for training, all the hyper-parameters the same except I set tensor parallel to 2 in Megatron, so the dp size is different. However this shouldn't make loss diverge in my opinion. I've also tried smaller lr, also doesn't work. At first I doubt maybe something is wrong in the bf16 optimizer implementation, but in this issue fp32 also triggers the problem as mentioned by @xingchensong.

kisseternity avatar Nov 02 '23 09:11 kisseternity

@kisseternity could u try stage-3 to double-check this issue?

xingchensong avatar Nov 02 '23 09:11 xingchensong

@kisseternity could u try stage-3 to double-check this issue?

As I'm using Ethernet between nodes, ZeRO3 is not a choice... When the training is done with Megatron, I may try it.

kisseternity avatar Nov 02 '23 09:11 kisseternity

Besides, loss converges normally when I use Megatron for training, all the hyper-parameters the same except I set tensor parallel to 2 in Megatron, so the dp size is different. However this shouldn't make loss diverge in my opinion. I've also tried smaller lr, also doesn't work. At first I doubt maybe something is wrong in the bf16 optimizer implementation, but in this issue fp32 also triggers the problem as mentioned by @xingchensong.

Add DeepSpeed training tensorboard(unfortunately not with grad_norm), comparing to Megatron. DeepSpeed image

Megatron(With multiple runs) image

Could you pls take a look at this issue again, thanks. @tjruwase

kisseternity avatar Nov 03 '23 02:11 kisseternity

double-check that there has a pretty serious question in stage-2.

stage-1 & stage-3 look good and almost equal in loss & grad_norm. cc @tjruwase @loadams

image

xingchensong avatar Nov 05 '23 14:11 xingchensong

I have also encountered the same problem and look forward to a solution.

zzhanghub avatar Nov 10 '23 04:11 zzhanghub

Maybe this is a bug introduced recently, I will try to downgrade the deepspeed version this weekend to see if it's true.

kisseternity avatar Nov 10 '23 06:11 kisseternity

@kisseternity, @xingchensong, @zzhanghub thanks for the triaging effort on this issue. Could you please try v0.10.0 as well? Thanks!

tjruwase avatar Nov 10 '23 10:11 tjruwase

@kisseternity, @xingchensong, @zzhanghub thanks for the triaging effort on this issue. Could you please try v0.10.0 as well? Thanks!

I've tried v0.10.0 and the issue still occurred.

kisseternity avatar Nov 13 '23 08:11 kisseternity

@kisseternity, thanks for the update. We will investigate further.

tjruwase avatar Nov 13 '23 15:11 tjruwase

Update: We have investigate further. The norm difference is caused by the gradient difference. Our current solution is to turn the overlap_comm to False when using Zero2. This fixes the issue, however, makes the training slower. @tjruwase hope this may help.

superaha avatar Nov 14 '23 02:11 superaha

@kisseternity could u try stage-3 to double-check this issue?

I've tried ZeRO3 and it's fine. Besides the training speed is fine compared to ZeRO2 with communication overlap, better than expected. It seems that the prefetch in ZeRO3 can overlap some 1.5x communication costs, making the speed pretty fast even between Ethernet connected nodes.

kisseternity avatar Nov 14 '23 02:11 kisseternity

@kisseternity could u try stage-3 to double-check this issue?

I've tried ZeRO3 and it's fine. Besides the training speed is fine compared to ZeRO2 with communication overlap, better than expected. It seems that the prefetch in ZeRO3 can overlap some 1.5x communication costs, making the speed pretty fast even between Ethernet connected nodes.

Does 'prefetch' here refer to a parameter of the torch dataloader or a parameter of deepspeed?

xingchensong avatar Nov 14 '23 03:11 xingchensong

@kisseternity could u try stage-3 to double-check this issue?

I've tried ZeRO3 and it's fine. Besides the training speed is fine compared to ZeRO2 with communication overlap, better than expected. It seems that the prefetch in ZeRO3 can overlap some 1.5x communication costs, making the speed pretty fast even between Ethernet connected nodes.

Does 'prefetch' here refer to a parameter of the torch dataloader or a parameter of deepspeed?

Here I mean the prefetch of next layer of the model to do all_gather, you can refer to stage3_prefetch_bucket_size for reference. The prefetch of dataloader also overlap the process time of data, but the main communication time costs lay in the all_gather and reduce ops in llm with zero3.

kisseternity avatar Nov 14 '23 03:11 kisseternity

@kisseternity could u try stage-3 to double-check this issue?

I've tried ZeRO3 and it's fine. Besides the training speed is fine compared to ZeRO2 with communication overlap, better than expected. It seems that the prefetch in ZeRO3 can overlap some 1.5x communication costs, making the speed pretty fast even between Ethernet connected nodes.

Does 'prefetch' here refer to a parameter of the torch dataloader or a parameter of deepspeed?

Here I mean the prefetch of next layer of the model to do all_gather, you can refer to stage3_prefetch_bucket_size for reference. The prefetch of dataloader also overlap the process time of data, but the main communication time costs lay in the all_gather and scatter ops in llm.

great, thx

xingchensong avatar Nov 14 '23 03:11 xingchensong

Update: We have investigate further. The norm difference is caused by the gradient difference. Our current solution is to turn the overlap_comm to False when using Zero2. This fixes the issue, however, makes the training slower. @tjruwase hope this may help.

@superaha, thanks for this valuable tip.

tjruwase avatar Nov 14 '23 14:11 tjruwase

Hi, teams, any update?

xingchensong avatar Nov 28 '23 03:11 xingchensong

Has there been any recent progress? 🤔️

zzhanghub avatar Dec 05 '23 02:12 zzhanghub

I am convinced that this issue is of significant concern. Have you identified any potential solutions?

patrick-tssn avatar Mar 18 '24 03:03 patrick-tssn

This issue still exists in 0.14.0. Zero2 has a much larger grad norm than Zero3.

mutonix avatar May 20 '24 14:05 mutonix