DeepSpeed
DeepSpeed copied to clipboard
[BUG] grad_norm and loss is nan when deepspeed==0.13.5 but ok with deepspeed==0.10.2
Describe the bug
when fine-tuning my model using deepspeed==0.13.5, and huggingface trainer, loss and grad_norm will be nan at step 2
but 2 ways below could solve the problem
- deepseed==0.10.2
- add this to my deepspeed config, (which slow down my training speed)
"comms_logger": {
"enabled": true,
"verbose": false,
"prof_all": true,
"debug": false
}
why this happen? maybe there are bugs I don't know? or any clues to solve this?
To Reproduce Steps to reproduce the behavior:
- my run script
deepspeed \
pretraining.py \
--model_type auto \
--model_name_or_path /app/nfs_share_dir/3/llm_model/Baichuan2-7B-Base \
--train_file_dir /app/nfs_share_dir/1/archive/v2/token-baichuan/tmp \
--validation_file_dir /app/nfs_share_dir/1/archive/v2/token-baichuan/tmp \
--lazy_mode True \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--do_train \
--do_eval \
--seed 3 \
--warmup_ratio 0.01 \
--num_train_epochs 1 \
--learning_rate 2e-5 \
--lr_scheduler_type cosine \
--weight_decay 1e-4 \
--logging_strategy steps \
--logging_steps 1 \
--save_steps 1000 \
--save_strategy steps \
--save_total_limit 10 \
--gradient_accumulation_steps 1 \
--block_size 4096 \
--torch_compile True \
--output_dir outputs_qwen \
--overwrite_output_dir \
--ddp_timeout 30000 \
--logging_first_step True \
--log_on_each_node 0 \
--torch_dtype bfloat16 \
--report_to tensorboard \
--ddp_find_unused_parameters False \
--gradient_checkpointing True \
--deepspeed ./config/ds_2_config.json \
--bf16 \
--bf16_full_eval
- deepspeed config
{
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"weight_decay": "auto",
"torch_adam": true,
"adam_w_mode": true
}
},
"bf16": {
"enabled": true
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"reduce_scatter": true,
"reduce_bucket_size": "auto",
"overlap_comm": true,
"contiguous_gradients": true
},
"flops_profiler": {
"enabled": true,
"profile_step": 10,
"module_depth": -1,
"top_modules": 1,
"detailed": true
},
"tensorboard": {
"enabled": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 1000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
Expected behavior loss != 0 or nan
ds_report output
[2024-03-08 17:53:06,490] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.10/site-packages/torch']
torch version .................... 2.1.2
deepspeed install path ........... ['/opt/conda/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.13.5, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 503.72 GB
Screenshots
System info (please complete the following information):
- OS: Linux version 3.10.0-1127.19.1.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623
- 1 machines with x8 A800s each
- Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
- Python version: 3.10.13
- transformers==4.38.0
@Chandler-Bing - are you able to test with any versions in between 0.10.2 and 0.13.5? Could you test with 0.11.x or 0.12.x?
@loadams sure, I test different version of deepspeed.
- deepspeed==0.11.2 Y
- deepspeed==0.12.1 Y
- deepspeed==0.12.2 Y
- deepspeed==0.12.3 Y
- deepspeed==0.12.4 N (grad_norm always be 1.0, and loss 0)
- deepspeed==0.12.5 N (grad_norm always be 1.0, and loss 0)
- deepspeed==0.12.6 N (grad_norm always be 1.0, and loss 0)
- deepspeed==0.13.5 N (grad_norm always be nan, and loss 0)
seems there are some changes after 0.12.4
@Chandler-Bing - thanks, the changelog between 0.12.3 and 0.12.4 is fairly small: https://github.com/microsoft/DeepSpeed/compare/v0.12.3...v0.12.4
I'll have to take a closer look, but if you're able to easily git bisect/binary search those commits that would help, using something like pip install git+https://github.com/microsoft/deepspeed.git@
@loadams I also encountered the same problem. More exp:
deepspeed==0.12.4, zero-2, multi-node. N (grad_norm always be 1.0, and loss 0) deepspeed==0.12.4, zero-2, one-node. Y deepspeed==0.12.4, zero-3, multi-node. Y deepspeed==0.12.4, zero-3, one-node. Y
when set use_multi_rank_bucket_allreduce as false, the loss and grad_norm are normal, but the training speed is significantly slower than deepspeed==0.12.3. So I guess the commit causing the problem is https://github.com/microsoft/DeepSpeed/pull/4695.
same here
I'm using deepspeed 0.13.1 with torch 2.2.1 cuda 12.2 on one node with 8 * A100(40G).
I train this LM with bf16. Then grad_norm is always nan
@loadams sorry for the late... I think @renhouxing is right.
I tried every commit pip install git+https://github.com/microsoft/deepspeed.git@ [commit from 0.12.3 to 0.12.4](https://github.com/microsoft/DeepSpeed/compare/v0.12.3...v0.12.4)
{'loss': 2.6719, 'grad_norm': 1.0, 'learning_rate': 1.25e-06, 'timestamp': 1710756122.1381884, 'global_step': 1, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 1.0, 'learning_rate': 2.5e-06, 'timestamp': 1710756132.5874143, 'global_step': 2, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 1.0, 'learning_rate': 3.7500000000000005e-06, 'timestamp': 1710756142.9050674, 'global_step': 3, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 1.0, 'learning_rate': 5e-06, 'timestamp': 1710756153.2131193, 'global_step': 4, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 1.0, 'learning_rate': 6.25e-06, 'timestamp': 1710756163.5214288, 'global_step': 5, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 1.0, 'learning_rate': 7.500000000000001e-06, 'timestamp': 1710756173.8260207, 'global_step': 6, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 1.0, 'learning_rate': 8.750000000000001e-06, 'timestamp': 1710756184.1361465, 'global_step': 7, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 1.0, 'learning_rate': 1e-05, 'timestamp': 1710756194.4409266, 'global_step': 8, 'epoch': 0.01}
pip list | grep deep
deepspeed 0.12.4+2afa1c7f
since commit 2afa1c7f, loss always be 0
it seems like a big commit , I will proceed to inspect the code line by line, identify the issues, and provide more useful information.
Hello, I have the same issue. When training on the A100, everything operates normally during MLLM stage1. However, during stage2, ds 0.9.5 functions correctly, but version 0.14.0 does not. With AMD MI250X, ds 0.14.0 fails to work properly, whereas ds 0.12.3 does. Additionally, when using ds 0.14.0, I noticed that multi-node setup does not accelerate my training (the same time as that of one node). I'm unsure if this issue is related.
Hello, I have the same issue. When training on the A100, everything operates normally during MLLM stage1. However, during stage2, ds 0.9.5 functions correctly, but version 0.14.0 does not. With AMD MI250X, ds 0.14.0 fails to work properly, whereas ds 0.12.3 does. Additionally, when using ds 0.14.0, I noticed that multi-node setup does not accelerate my training (the same time as that of one node). I'm unsure if this issue is related.
Hi @xxtars - can you create a new issue with the errors you are seeing on AMD?
Hello, I have the same issue. When training on the A100, everything operates normally during MLLM stage1. However, during stage2, ds 0.9.5 functions correctly, but version 0.14.0 does not. With AMD MI250X, ds 0.14.0 fails to work properly, whereas ds 0.12.3 does. Additionally, when using ds 0.14.0, I noticed that multi-node setup does not accelerate my training (the same time as that of one node). I'm unsure if this issue is related.
Hi @xxtars - can you create a new issue with the errors you are seeing on AMD?
@loadams I have created a new issue to describe my problem #5347. Any advice on stabilizing and speeding up training would be greatly appreciated.
Same issue
Setting overlap_comm to False can avoid this problem.
Thanks efsotr. Resolved!
I take that back. Issue still persists.
i am facing the same problem with the latest deepspeed version , I am getting loss to be a constant (~11) and grad_norm is constantly 0. I was able to run the job on 1 gpu but i wont be lucky everytime . If anyone has solved this it will be a great help.
I encounter the same question when ds version==0.14,any solution to solve it?Must I change the version?
Yeah, I have the same problem using v0.14 but works using v0.12.3,there may be a bug in the latest version.
When I trained the model in single-node multi-GPU environment, it seems to be worked. But multi-nodes is not expected for me.
Same here. Using 0.14. Single node,4 H100,qwen2-7b,grad_norm always NaN. But works for Single node 6 H100.
@darcula1993 I am curious about what your deepspeed configuration are like.
Setting overlap_comm to False can avoid this problem.
this solved the issue for me -- but what can we do to use overlap_comm? there is a significant slow down in training
Setting overlap_comm to False can avoid this problem.
this solved the issue for me -- but what can we do to use overlap_comm? there is a significant slow down in training
It seems that https://github.com/microsoft/DeepSpeed/pull/5606 solve this problem.
So, is this bug solved or not?
the same question when ds version==0.14, Single node,8 A100,qwen2-7b,grad_norm always NaN. But works for ds version==0.10.2
So, is this bug solved or not?
Are you able to test with the latest version from source to confirm?
So, is this bug solved or not?
Are you able to test with the latest version from source to confirm?
Hi, is there any fix of this bug in the last few versions?