DeepSpeed [BUG] grad_norm and loss is nan when deepspeed==0.13.5 but ok with deepspeed==0.10.2

Describe the bug when fine-tuning my model using deepspeed==0.13.5, and huggingface trainer, loss and grad_norm will be nan at step 2

but 2 ways below could solve the problem

deepseed==0.10.2
add this to my deepspeed config, (which slow down my training speed)

"comms_logger": {
  "enabled": true,
  "verbose": false,
  "prof_all": true,
  "debug": false
}

why this happen? maybe there are bugs I don't know? or any clues to solve this?

To Reproduce Steps to reproduce the behavior:

my run script

deepspeed \
    pretraining.py \
    --model_type auto \
    --model_name_or_path /app/nfs_share_dir/3/llm_model/Baichuan2-7B-Base \
    --train_file_dir /app/nfs_share_dir/1/archive/v2/token-baichuan/tmp \
    --validation_file_dir /app/nfs_share_dir/1/archive/v2/token-baichuan/tmp \
    --lazy_mode True \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --do_train \
    --do_eval \
    --seed 3 \
    --warmup_ratio 0.01 \
    --num_train_epochs 1 \
    --learning_rate 2e-5 \
    --lr_scheduler_type cosine \
    --weight_decay 1e-4 \
    --logging_strategy steps \
    --logging_steps 1 \
    --save_steps 1000 \
    --save_strategy steps \
    --save_total_limit 10 \
    --gradient_accumulation_steps 1 \
    --block_size 4096 \
    --torch_compile True \
    --output_dir outputs_qwen \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --log_on_each_node 0 \
    --torch_dtype bfloat16 \
    --report_to tensorboard \
    --ddp_find_unused_parameters False \
    --gradient_checkpointing True \
    --deepspeed ./config/ds_2_config.json \
    --bf16 \
    --bf16_full_eval

deepspeed config

{
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "weight_decay": "auto",
            "torch_adam": true,
            "adam_w_mode": true
        }
    },
    "bf16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 5e8,
        "reduce_scatter": true,
        "reduce_bucket_size": "auto",
        "overlap_comm": true,
        "contiguous_gradients": true
    },
    "flops_profiler": {
        "enabled": true,
        "profile_step": 10,
        "module_depth": -1,
        "top_modules": 1,
        "detailed": true
    },
    "tensorboard": {
        "enabled": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 1000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Expected behavior loss != 0 or nan

ds_report output

[2024-03-08 17:53:06,490] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.10/site-packages/torch']
torch version .................... 2.1.2
deepspeed install path ........... ['/opt/conda/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.13.5, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 503.72 GB

Screenshots

System info (please complete the following information):

OS: Linux version 3.10.0-1127.19.1.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623
1 machines with x8 A800s each
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
Python version: 3.10.13
transformers==4.38.0

Mar 08 '24 10:03 Chandler-Bing

@Chandler-Bing - are you able to test with any versions in between 0.10.2 and 0.13.5? Could you test with 0.11.x or 0.12.x?

Mar 11 '24 22:03 loadams

@loadams sure, I test different version of deepspeed.

deepspeed==0.11.2 Y
deepspeed==0.12.1 Y
deepspeed==0.12.2 Y
deepspeed==0.12.3 Y
deepspeed==0.12.4 N (grad_norm always be 1.0, and loss 0)
deepspeed==0.12.5 N (grad_norm always be 1.0, and loss 0)
deepspeed==0.12.6 N (grad_norm always be 1.0, and loss 0)
deepspeed==0.13.5 N (grad_norm always be nan, and loss 0)

seems there are some changes after 0.12.4

Mar 12 '24 02:03 Chandler-Bing

@Chandler-Bing - thanks, the changelog between 0.12.3 and 0.12.4 is fairly small: https://github.com/microsoft/DeepSpeed/compare/v0.12.3...v0.12.4

I'll have to take a closer look, but if you're able to easily git bisect/binary search those commits that would help, using something like pip install git+https://github.com/microsoft/deepspeed.git@

Mar 12 '24 17:03 loadams

@loadams I also encountered the same problem. More exp:

deepspeed==0.12.4, zero-2, multi-node. N (grad_norm always be 1.0, and loss 0) deepspeed==0.12.4, zero-2, one-node. Y deepspeed==0.12.4, zero-3, multi-node. Y deepspeed==0.12.4, zero-3, one-node. Y

when set use_multi_rank_bucket_allreduce as false, the loss and grad_norm are normal, but the training speed is significantly slower than deepspeed==0.12.3. So I guess the commit causing the problem is https://github.com/microsoft/DeepSpeed/pull/4695.

Mar 13 '24 04:03 renhouxing

same here

Mar 14 '24 10:03 dumpmemory

I'm using deepspeed 0.13.1 with torch 2.2.1 cuda 12.2 on one node with 8 * A100(40G). I train this LM with bf16. Then grad_norm is always nan

Mar 14 '24 23:03 tic-top

@loadams sorry for the late... I think @renhouxing is right. I tried every commit pip install git+https://github.com/microsoft/deepspeed.git@ [commit from 0.12.3 to 0.12.4](https://github.com/microsoft/DeepSpeed/compare/v0.12.3...v0.12.4)

{'loss': 2.6719, 'grad_norm': 1.0, 'learning_rate': 1.25e-06, 'timestamp': 1710756122.1381884, 'global_step': 1, 'epoch': 0.0}                                                                                                         
{'loss': 0.0, 'grad_norm': 1.0, 'learning_rate': 2.5e-06, 'timestamp': 1710756132.5874143, 'global_step': 2, 'epoch': 0.0}                                                                                                             
{'loss': 0.0, 'grad_norm': 1.0, 'learning_rate': 3.7500000000000005e-06, 'timestamp': 1710756142.9050674, 'global_step': 3, 'epoch': 0.0}                                                                                              
{'loss': 0.0, 'grad_norm': 1.0, 'learning_rate': 5e-06, 'timestamp': 1710756153.2131193, 'global_step': 4, 'epoch': 0.0}                                                                                                               
{'loss': 0.0, 'grad_norm': 1.0, 'learning_rate': 6.25e-06, 'timestamp': 1710756163.5214288, 'global_step': 5, 'epoch': 0.0}                                                                                                            
{'loss': 0.0, 'grad_norm': 1.0, 'learning_rate': 7.500000000000001e-06, 'timestamp': 1710756173.8260207, 'global_step': 6, 'epoch': 0.0}                                                                                               
{'loss': 0.0, 'grad_norm': 1.0, 'learning_rate': 8.750000000000001e-06, 'timestamp': 1710756184.1361465, 'global_step': 7, 'epoch': 0.0}                                                                                               
{'loss': 0.0, 'grad_norm': 1.0, 'learning_rate': 1e-05, 'timestamp': 1710756194.4409266, 'global_step': 8, 'epoch': 0.01}

pip list | grep deep
deepspeed                     0.12.4+2afa1c7f

since commit 2afa1c7f, loss always be 0

it seems like a big commit , I will proceed to inspect the code line by line, identify the issues, and provide more useful information.

Mar 18 '24 10:03 Chandler-Bing

Hello, I have the same issue. When training on the A100, everything operates normally during MLLM stage1. However, during stage2, ds 0.9.5 functions correctly, but version 0.14.0 does not. With AMD MI250X, ds 0.14.0 fails to work properly, whereas ds 0.12.3 does. Additionally, when using ds 0.14.0, I noticed that multi-node setup does not accelerate my training (the same time as that of one node). I'm unsure if this issue is related.

Mar 29 '24 15:03 xxtars

Hello, I have the same issue. When training on the A100, everything operates normally during MLLM stage1. However, during stage2, ds 0.9.5 functions correctly, but version 0.14.0 does not. With AMD MI250X, ds 0.14.0 fails to work properly, whereas ds 0.12.3 does. Additionally, when using ds 0.14.0, I noticed that multi-node setup does not accelerate my training (the same time as that of one node). I'm unsure if this issue is related.

Hi @xxtars - can you create a new issue with the errors you are seeing on AMD?

Apr 01 '24 15:04 loadams

Hello, I have the same issue. When training on the A100, everything operates normally during MLLM stage1. However, during stage2, ds 0.9.5 functions correctly, but version 0.14.0 does not. With AMD MI250X, ds 0.14.0 fails to work properly, whereas ds 0.12.3 does. Additionally, when using ds 0.14.0, I noticed that multi-node setup does not accelerate my training (the same time as that of one node). I'm unsure if this issue is related.

Hi @xxtars - can you create a new issue with the errors you are seeing on AMD?

@loadams I have created a new issue to describe my problem #5347. Any advice on stabilizing and speeding up training would be greatly appreciated.

Apr 02 '24 06:04 xxtars

Same issue

May 11 '24 07:05 shashwat14

Setting overlap_comm to False can avoid this problem.

May 11 '24 07:05 efsotr

Thanks efsotr. Resolved!

May 12 '24 07:05 shashwat14

I take that back. Issue still persists.

May 12 '24 08:05 shashwat14

i am facing the same problem with the latest deepspeed version , I am getting loss to be a constant (~11) and grad_norm is constantly 0. I was able to run the job on 1 gpu but i wont be lucky everytime . If anyone has solved this it will be a great help.

May 13 '24 14:05 abadjatya

I encounter the same question when ds version==0.14，any solution to solve it？Must I change the version？

May 28 '24 13:05 TuuSiwei

Yeah, I have the same problem using v0.14 but works using v0.12.3,there may be a bug in the latest version.

When I trained the model in single-node multi-GPU environment, it seems to be worked. But multi-nodes is not expected for me.

Jun 15 '24 06:06 zipzou

Same here. Using 0.14. Single node,4 H100,qwen2-7b,grad_norm always NaN. But works for Single node 6 H100.

Jul 10 '24 10:07 darcula1993

@darcula1993 I am curious about what your deepspeed configuration are like.

Jul 10 '24 10:07 efsotr

Setting overlap_comm to False can avoid this problem.

this solved the issue for me -- but what can we do to use overlap_comm? there is a significant slow down in training

Jul 11 '24 18:07 orrzohar

Setting overlap_comm to False can avoid this problem.

this solved the issue for me -- but what can we do to use overlap_comm? there is a significant slow down in training

It seems that https://github.com/microsoft/DeepSpeed/pull/5606 solve this problem.

Jul 12 '24 01:07 efsotr

So, is this bug solved or not?

Jul 12 '24 01:07 zipzou

the same question when ds version==0.14， Single node,8 A100,qwen2-7b,grad_norm always NaN. But works for ds version==0.10.2

Jul 12 '24 02:07 HuangJoJo

So, is this bug solved or not?

Are you able to test with the latest version from source to confirm?

Jul 12 '24 04:07 loadams

So, is this bug solved or not?

Are you able to test with the latest version from source to confirm?

Hi, is there any fix of this bug in the last few versions?

Oct 12 '24 02:10 kuozhang

DeepSpeed DeepSpeed copied to clipboard

[BUG] grad_norm and loss is nan when deepspeed==0.13.5 but ok with deepspeed==0.10.2

DeepSpeed
DeepSpeed copied to clipboard