LLaMA-Factory 用deepspeed zero-3-offload去微调DeepSeek-R1-Distill-Qwen-32B，系统卡住，长时间无反应

Reminder

[x] I have read the above rules and searched the existing issues.

System Info

llamafactory version: 0.9.2.dev0
Platform: Linux-5.15.0-131-generic-x86_64-with-glibc2.35
Python version: 3.10.4
PyTorch version: 2.5.1+cu121 (GPU)
Transformers version: 4.48.3
Datasets version: 3.2.0
Accelerate version: 1.2.1
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA GeForce RTX 3090
GPU number: 8
GPU memory: 23.69GB
DeepSpeed version: 0.16.3
Bitsandbytes version: 0.45.2
vLLM version: 0.6.5

Reproduction

微调的配置文件deepseek_distill_qwen_32B_lora_sft.yaml如下：

### model
model_name_or_path: DeepSeek-R1-Distill-Qwen-32B
#trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
deepspeed: ds_z3_offload_config.json

# use FlashAttention
flash_attn: fa2

### dataset
dataset: TCM_SFT
template: deepseek3
cutoff_len: 300
overwrite_cache: true
preprocessing_num_workers: 8

### output
output_dir: deepseek_SFT/deepseek_distill_qwen_32B
logging_steps: 1
save_steps: 200
plot_loss: true
overwrite_output_dir: true
save_total_limit: 10
tokenized_path:/sft_tokenized_path/deepseek_distill_qwen_32B_cutoff_300

### train
per_device_train_batch_size: 1
#gradient_accumulation_steps: 8
gradient_accumulation_steps: 2
learning_rate: 5.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.01
weight_decay: 0.05
bf16: true
ddp_timeout: 180000000

ds_z3_offload_config.json文件是https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/deepspeed/ds_z3_offload_config.json。

用单机8卡3090训练，在读取模型的一些配置文件后，GPU服务器就卡住了，类似于死机的状态，很长时间后才有反应。训练命令如下：

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export FORCE_TORCHRUN=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
llamafactory-cli train deepseek_distill_qwen_32B_lora_sft.yaml

log日志如下：

[2025-02-16 23:56:50,210] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2025-02-16 23:56:50,212] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[INFO|modeling_utils.py:3901] 2025-02-16 23:56:50,217 >> loading weights file /mnt/4tdisk/huazhou/original_model/DeepSeek-R1-Distill-Qwen-32B/model.safetensors.index.json
[INFO|modeling_utils.py:4078] 2025-02-16 23:56:50,218 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[2025-02-16 23:56:50,219] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[WARNING|logging.py:328] 2025-02-16 23:56:50,224 >> You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
[WARNING|logging.py:328] 2025-02-16 23:56:50,224 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2025-02-16 23:56:50,227] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2025-02-16 23:56:50,227] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2025-02-16 23:56:50,227] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2025-02-16 23:56:50,240] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[INFO|configuration_utils.py:1140] 2025-02-16 23:56:50,240 >> Generate config GenerationConfig {
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "use_cache": false
}

[2025-02-16 23:56:50,240] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[rank6]:[E217 00:13:05.609415848 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 6] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. 
[rank3]:[E217 00:14:54.880931348 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 3] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. 
[rank3]:[E217 00:14:54.238189864 ProcessGroupNCCL.cpp:1515] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
[rank0]:[E217 00:22:12.753574225 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. 
[rank0]:[E217 00:22:12.082219285 ProcessGroupNCCL.cpp:1515] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
[rank1]:[E217 00:22:49.585706993 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 1] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. 
[rank1]:[E217 00:22:50.897909513 ProcessGroupNCCL.cpp:1515] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
[rank3]:[F217 00:22:54.246585945 ProcessGroupNCCL.cpp:1306] [PG ID 0 PG GUID 0(default_pg) Rank 3] [PG ID 0 PG GUID 0(default_pg) Rank 3] Terminating the process after attempting to dump debug info, due to ProcessGroupNCCL watchdog hang.
[rank6]:[E217 00:26:23.883287055 ProcessGroupNCCL.cpp:1515] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
[rank0]:[F217 00:30:23.465171480 ProcessGroupNCCL.cpp:1306] [PG ID 0 PG GUID 0(default_pg) Rank 0] [PG ID 0 PG GUID 0(default_pg) Rank 0] Terminating the process after attempting to dump debug info, due to ProcessGroupNCCL watchdog hang.
[rank1]:[F217 00:32:51.737126758 ProcessGroupNCCL.cpp:1306] [PG ID 0 PG GUID 0(default_pg) Rank 1] [PG ID 0 PG GUID 0(default_pg) Rank 1] Terminating the process after attempting to dump debug info, due to ProcessGroupNCCL watchdog hang.
[rank6]:[F217 00:37:36.558389836 ProcessGroupNCCL.cpp:1306] [PG ID 0 PG GUID 0(default_pg) Rank 6] [PG ID 0 PG GUID 0(default_pg) Rank 6] Terminating the process after attempting to dump debug info, due to ProcessGroupNCCL watchdog hang.
[rank5]:[E217 04:33:09.329933871 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 5] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. 
[rank2]:[E217 04:33:09.338193755 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 2] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. 
[rank4]:[E217 04:33:09.344455978 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 4] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. 
[rank4]:[E217 04:33:11.155788790 ProcessGroupNCCL.cpp:1515] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
[rank5]:[E217 04:33:10.470933723 ProcessGroupNCCL.cpp:1515] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
[rank2]:[E217 04:33:13.305947571 ProcessGroupNCCL.cpp:1515] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
W0217 04:33:15.292410 7362 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7427 closing signal SIGTERM
W0217 04:33:16.276860 7362 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7429 closing signal SIGTERM
W0217 04:33:16.277613 7362 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7431 closing signal SIGTERM
W0217 04:33:16.278180 7362 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7432 closing signal SIGTERM
W0217 04:33:16.278744 7362 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7433 closing signal SIGTERM
W0217 04:33:16.279249 7362 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7434 closing signal SIGTERM
E0217 04:33:24.095435 7362 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -9) local_rank: 3 (pid: 7430) of binary: /mnt/4tdisk/huazhou/envs/llamafactorypy310/bin/python
Traceback (most recent call last):
  File "/mnt/4tdisk/huazhou/envs/llamafactorypy310/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/4tdisk/huazhou/envs/llamafactorypy310/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/mnt/4tdisk/huazhou/envs/llamafactorypy310/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/mnt/4tdisk/huazhou/envs/llamafactorypy310/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/mnt/4tdisk/huazhou/envs/llamafactorypy310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/4tdisk/huazhou/envs/llamafactorypy310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/huazhou/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-17_04:33:15
  host      : bys
  rank      : 3 (local_rank: 3)
  exitcode  : -9 (pid: 7430)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 7430
============================================================

Others

No response

@hiyouga 请大佬帮忙看看，感谢感谢！！！

Feb 17 '25 07:02 erichuazhou

1.DeepSeek-R1-Distill-Qwen-32B 的模版是qwen吧 2.8卡3090 可能显存不够

Feb 21 '25 01:02 jienimi

1.DeepSeek-R1-Distill-Qwen-32B 的模版是qwen吧 2.8卡3090 可能显存不够

1、template: deepseek3，模板是deepseek3 2、8*24G，采用z3-offload，还是不够吗？我不确定需要多大的GPU。还请大神赐教。 @jienimi

Feb 21 '25 02:02 erichuazhou

Root Cause (first observed failure): [0]: time : 2025-02-17_04:33:15 host : bys rank : 3 (local_rank: 3) exitcode : -9 (pid: 7430) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 7430

1.DeepSeek-R1-Distill-Qwen-32B 模板应该是qwen 2. “8*24G，采用z3-offload” 计算一下显存应该是够。但是 exitcode : -9 (pid: 7430) 我以前查资料时内存不够的原因，碰到相同的错误，后来我加了内存正常跑起来了。

Feb 21 '25 09:02 jienimi

请教一下，这种带CoT的模型的微调数据，是不是也得用带有的数据去微调呀？

Feb 21 '25 18:02 Jimmy-L99

1.DeepSeek-R1-Distill-Qwen-32B 的模版是qwen吧 2.8卡3090 可能显存不够

1、template: deepseek3，模板是deepseek3 2、8*24G，采用z3-offload，还是不够吗？我不确定需要多大的GPU。还请大神赐教。 @jienimi

backbone是qwen，当然是用的qwen template，和deepseek无关

Feb 23 '25 15:02 cehao628

1.DeepSeek-R1-Distill-Qwen-32B 的模版是qwen吧 2.8卡3090 可能显存不够

1、template: deepseek3，模板是deepseek3 2、8*24G，采用z3-offload，还是不够吗？我不确定需要多大的GPU。还请大神赐教。 @jienimi

cutoff_len: 300也没什么意义...截断了思考过程

Feb 23 '25 15:02 cehao628

你好，请问你微调后有，在推理时，有遇到如下报错吗？
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) TypeError: not a string

Mar 04 '25 10:03 yumingfan-0219

1.DeepSeek-R1-Distill-Qwen-32B 的模版是qwen吧 2.8卡3090 可能显存不够

1、template: deepseek3，模板是deepseek3 2、8*24G，采用z3-offload，还是不够吗？我不确定需要多大的GPU。还请大神赐教。 @jienimi

backbone是qwen，当然是用的qwen template，和deepseek无关

我确认了下是deepseek3的模板，虽然用的是qwen的backbone, 但是他们用的自己的模板sft

Mar 07 '25 11:03 Linzwcs

内存不够，被系统杀掉了（kill -9）。加大内存就行

Mar 25 '25 11:03 zsai001