用deepspeed zero-3-offload去微调DeepSeek-R1-Distill-Qwen-32B,系统卡住,长时间无反应
Reminder
- [x] I have read the above rules and searched the existing issues.
System Info
llamafactoryversion: 0.9.2.dev0- Platform: Linux-5.15.0-131-generic-x86_64-with-glibc2.35
- Python version: 3.10.4
- PyTorch version: 2.5.1+cu121 (GPU)
- Transformers version: 4.48.3
- Datasets version: 3.2.0
- Accelerate version: 1.2.1
- PEFT version: 0.12.0
- TRL version: 0.9.6
- GPU type: NVIDIA GeForce RTX 3090
- GPU number: 8
- GPU memory: 23.69GB
- DeepSpeed version: 0.16.3
- Bitsandbytes version: 0.45.2
- vLLM version: 0.6.5
Reproduction
微调的配置文件deepseek_distill_qwen_32B_lora_sft.yaml如下:
### model
model_name_or_path: DeepSeek-R1-Distill-Qwen-32B
#trust_remote_code: true
### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
deepspeed: ds_z3_offload_config.json
# use FlashAttention
flash_attn: fa2
### dataset
dataset: TCM_SFT
template: deepseek3
cutoff_len: 300
overwrite_cache: true
preprocessing_num_workers: 8
### output
output_dir: deepseek_SFT/deepseek_distill_qwen_32B
logging_steps: 1
save_steps: 200
plot_loss: true
overwrite_output_dir: true
save_total_limit: 10
tokenized_path:/sft_tokenized_path/deepseek_distill_qwen_32B_cutoff_300
### train
per_device_train_batch_size: 1
#gradient_accumulation_steps: 8
gradient_accumulation_steps: 2
learning_rate: 5.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.01
weight_decay: 0.05
bf16: true
ddp_timeout: 180000000
ds_z3_offload_config.json文件是https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/deepspeed/ds_z3_offload_config.json。
用单机8卡3090训练,在读取模型的一些配置文件后,GPU服务器就卡住了,类似于死机的状态,很长时间后才有反应。 训练命令如下:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export FORCE_TORCHRUN=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
llamafactory-cli train deepseek_distill_qwen_32B_lora_sft.yaml
log日志如下:
[2025-02-16 23:56:50,210] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2025-02-16 23:56:50,212] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[INFO|modeling_utils.py:3901] 2025-02-16 23:56:50,217 >> loading weights file /mnt/4tdisk/huazhou/original_model/DeepSeek-R1-Distill-Qwen-32B/model.safetensors.index.json
[INFO|modeling_utils.py:4078] 2025-02-16 23:56:50,218 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[2025-02-16 23:56:50,219] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[WARNING|logging.py:328] 2025-02-16 23:56:50,224 >> You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
[WARNING|logging.py:328] 2025-02-16 23:56:50,224 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2025-02-16 23:56:50,227] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2025-02-16 23:56:50,227] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2025-02-16 23:56:50,227] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2025-02-16 23:56:50,240] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[INFO|configuration_utils.py:1140] 2025-02-16 23:56:50,240 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151643,
"use_cache": false
}
[2025-02-16 23:56:50,240] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[rank6]:[E217 00:13:05.609415848 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 6] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang.
[rank3]:[E217 00:14:54.880931348 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 3] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang.
[rank3]:[E217 00:14:54.238189864 ProcessGroupNCCL.cpp:1515] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
[rank0]:[E217 00:22:12.753574225 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang.
[rank0]:[E217 00:22:12.082219285 ProcessGroupNCCL.cpp:1515] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
[rank1]:[E217 00:22:49.585706993 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 1] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang.
[rank1]:[E217 00:22:50.897909513 ProcessGroupNCCL.cpp:1515] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
[rank3]:[F217 00:22:54.246585945 ProcessGroupNCCL.cpp:1306] [PG ID 0 PG GUID 0(default_pg) Rank 3] [PG ID 0 PG GUID 0(default_pg) Rank 3] Terminating the process after attempting to dump debug info, due to ProcessGroupNCCL watchdog hang.
[rank6]:[E217 00:26:23.883287055 ProcessGroupNCCL.cpp:1515] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
[rank0]:[F217 00:30:23.465171480 ProcessGroupNCCL.cpp:1306] [PG ID 0 PG GUID 0(default_pg) Rank 0] [PG ID 0 PG GUID 0(default_pg) Rank 0] Terminating the process after attempting to dump debug info, due to ProcessGroupNCCL watchdog hang.
[rank1]:[F217 00:32:51.737126758 ProcessGroupNCCL.cpp:1306] [PG ID 0 PG GUID 0(default_pg) Rank 1] [PG ID 0 PG GUID 0(default_pg) Rank 1] Terminating the process after attempting to dump debug info, due to ProcessGroupNCCL watchdog hang.
[rank6]:[F217 00:37:36.558389836 ProcessGroupNCCL.cpp:1306] [PG ID 0 PG GUID 0(default_pg) Rank 6] [PG ID 0 PG GUID 0(default_pg) Rank 6] Terminating the process after attempting to dump debug info, due to ProcessGroupNCCL watchdog hang.
[rank5]:[E217 04:33:09.329933871 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 5] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang.
[rank2]:[E217 04:33:09.338193755 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 2] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang.
[rank4]:[E217 04:33:09.344455978 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 4] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang.
[rank4]:[E217 04:33:11.155788790 ProcessGroupNCCL.cpp:1515] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
[rank5]:[E217 04:33:10.470933723 ProcessGroupNCCL.cpp:1515] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
[rank2]:[E217 04:33:13.305947571 ProcessGroupNCCL.cpp:1515] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
W0217 04:33:15.292410 7362 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7427 closing signal SIGTERM
W0217 04:33:16.276860 7362 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7429 closing signal SIGTERM
W0217 04:33:16.277613 7362 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7431 closing signal SIGTERM
W0217 04:33:16.278180 7362 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7432 closing signal SIGTERM
W0217 04:33:16.278744 7362 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7433 closing signal SIGTERM
W0217 04:33:16.279249 7362 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7434 closing signal SIGTERM
E0217 04:33:24.095435 7362 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -9) local_rank: 3 (pid: 7430) of binary: /mnt/4tdisk/huazhou/envs/llamafactorypy310/bin/python
Traceback (most recent call last):
File "/mnt/4tdisk/huazhou/envs/llamafactorypy310/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/mnt/4tdisk/huazhou/envs/llamafactorypy310/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/mnt/4tdisk/huazhou/envs/llamafactorypy310/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/mnt/4tdisk/huazhou/envs/llamafactorypy310/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/mnt/4tdisk/huazhou/envs/llamafactorypy310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/4tdisk/huazhou/envs/llamafactorypy310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/huazhou/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-02-17_04:33:15
host : bys
rank : 3 (local_rank: 3)
exitcode : -9 (pid: 7430)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 7430
============================================================
Others
No response
@hiyouga 请大佬帮忙看看,感谢感谢!!!
1.DeepSeek-R1-Distill-Qwen-32B 的模版是qwen吧 2.8卡3090 可能显存不够
1.DeepSeek-R1-Distill-Qwen-32B 的模版是qwen吧 2.8卡3090 可能显存不够
1、template: deepseek3,模板是deepseek3 2、8*24G,采用z3-offload,还是不够吗?我不确定需要多大的GPU。还请大神赐教。 @jienimi
Root Cause (first observed failure): [0]: time : 2025-02-17_04:33:15 host : bys rank : 3 (local_rank: 3) exitcode : -9 (pid: 7430) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 7430
1.DeepSeek-R1-Distill-Qwen-32B 模板 应该是qwen 2. “8*24G,采用z3-offload” 计算一下显存应该是够。 但是 exitcode : -9 (pid: 7430) 我以前查资料时内存不够的原因,碰到相同的错误,后来我加了内存正常跑起来了。
请教一下,这种带CoT的模型的微调数据,是不是也得用带有
1.DeepSeek-R1-Distill-Qwen-32B 的模版是qwen吧 2.8卡3090 可能显存不够
1、template: deepseek3,模板是deepseek3 2、8*24G,采用z3-offload,还是不够吗?我不确定需要多大的GPU。还请大神赐教。 @jienimi
backbone是qwen,当然是用的qwen template,和deepseek无关
1.DeepSeek-R1-Distill-Qwen-32B 的模版是qwen吧 2.8卡3090 可能显存不够
1、template: deepseek3,模板是deepseek3 2、8*24G,采用z3-offload,还是不够吗?我不确定需要多大的GPU。还请大神赐教。 @jienimi
cutoff_len: 300也没什么意义...截断了思考过程
你好,请问你微调后有,在推理时,有遇到 如下报错吗?
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string
1.DeepSeek-R1-Distill-Qwen-32B 的模版是qwen吧 2.8卡3090 可能显存不够
1、template: deepseek3,模板是deepseek3 2、8*24G,采用z3-offload,还是不够吗?我不确定需要多大的GPU。还请大神赐教。 @jienimi
backbone是qwen,当然是用的qwen template,和deepseek无关
我确认了下是deepseek3的模板,虽然用的是qwen的backbone, 但是他们用的自己的模板sft
内存不够,被系统杀掉了(kill -9)。加大内存就行