LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

用deepspeed zero-3-offload去微调DeepSeek-R1-Distill-Qwen-32B,系统卡住,长时间无反应

Open erichuazhou opened this issue 10 months ago • 8 comments

Reminder

  • [x] I have read the above rules and searched the existing issues.

System Info

  • llamafactory version: 0.9.2.dev0
  • Platform: Linux-5.15.0-131-generic-x86_64-with-glibc2.35
  • Python version: 3.10.4
  • PyTorch version: 2.5.1+cu121 (GPU)
  • Transformers version: 4.48.3
  • Datasets version: 3.2.0
  • Accelerate version: 1.2.1
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA GeForce RTX 3090
  • GPU number: 8
  • GPU memory: 23.69GB
  • DeepSpeed version: 0.16.3
  • Bitsandbytes version: 0.45.2
  • vLLM version: 0.6.5

Reproduction

微调的配置文件deepseek_distill_qwen_32B_lora_sft.yaml如下:

### model
model_name_or_path: DeepSeek-R1-Distill-Qwen-32B
#trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
deepspeed: ds_z3_offload_config.json

# use FlashAttention
flash_attn: fa2

### dataset
dataset: TCM_SFT
template: deepseek3
cutoff_len: 300
overwrite_cache: true
preprocessing_num_workers: 8

### output
output_dir: deepseek_SFT/deepseek_distill_qwen_32B
logging_steps: 1
save_steps: 200
plot_loss: true
overwrite_output_dir: true
save_total_limit: 10
tokenized_path:/sft_tokenized_path/deepseek_distill_qwen_32B_cutoff_300

### train
per_device_train_batch_size: 1
#gradient_accumulation_steps: 8
gradient_accumulation_steps: 2
learning_rate: 5.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.01
weight_decay: 0.05
bf16: true
ddp_timeout: 180000000

ds_z3_offload_config.json文件是https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/deepspeed/ds_z3_offload_config.json。

用单机8卡3090训练,在读取模型的一些配置文件后,GPU服务器就卡住了,类似于死机的状态,很长时间后才有反应。 训练命令如下:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export FORCE_TORCHRUN=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
llamafactory-cli train deepseek_distill_qwen_32B_lora_sft.yaml

log日志如下:

[2025-02-16 23:56:50,210] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2025-02-16 23:56:50,212] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[INFO|modeling_utils.py:3901] 2025-02-16 23:56:50,217 >> loading weights file /mnt/4tdisk/huazhou/original_model/DeepSeek-R1-Distill-Qwen-32B/model.safetensors.index.json
[INFO|modeling_utils.py:4078] 2025-02-16 23:56:50,218 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[2025-02-16 23:56:50,219] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[WARNING|logging.py:328] 2025-02-16 23:56:50,224 >> You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
[WARNING|logging.py:328] 2025-02-16 23:56:50,224 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2025-02-16 23:56:50,227] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2025-02-16 23:56:50,227] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[2025-02-16 23:56:50,227] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2025-02-16 23:56:50,240] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
[INFO|configuration_utils.py:1140] 2025-02-16 23:56:50,240 >> Generate config GenerationConfig {
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "use_cache": false
}

[2025-02-16 23:56:50,240] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 8
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[rank6]:[E217 00:13:05.609415848 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 6] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. 
[rank3]:[E217 00:14:54.880931348 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 3] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. 
[rank3]:[E217 00:14:54.238189864 ProcessGroupNCCL.cpp:1515] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
[rank0]:[E217 00:22:12.753574225 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. 
[rank0]:[E217 00:22:12.082219285 ProcessGroupNCCL.cpp:1515] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
[rank1]:[E217 00:22:49.585706993 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 1] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. 
[rank1]:[E217 00:22:50.897909513 ProcessGroupNCCL.cpp:1515] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
[rank3]:[F217 00:22:54.246585945 ProcessGroupNCCL.cpp:1306] [PG ID 0 PG GUID 0(default_pg) Rank 3] [PG ID 0 PG GUID 0(default_pg) Rank 3] Terminating the process after attempting to dump debug info, due to ProcessGroupNCCL watchdog hang.
[rank6]:[E217 00:26:23.883287055 ProcessGroupNCCL.cpp:1515] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
[rank0]:[F217 00:30:23.465171480 ProcessGroupNCCL.cpp:1306] [PG ID 0 PG GUID 0(default_pg) Rank 0] [PG ID 0 PG GUID 0(default_pg) Rank 0] Terminating the process after attempting to dump debug info, due to ProcessGroupNCCL watchdog hang.
[rank1]:[F217 00:32:51.737126758 ProcessGroupNCCL.cpp:1306] [PG ID 0 PG GUID 0(default_pg) Rank 1] [PG ID 0 PG GUID 0(default_pg) Rank 1] Terminating the process after attempting to dump debug info, due to ProcessGroupNCCL watchdog hang.
[rank6]:[F217 00:37:36.558389836 ProcessGroupNCCL.cpp:1306] [PG ID 0 PG GUID 0(default_pg) Rank 6] [PG ID 0 PG GUID 0(default_pg) Rank 6] Terminating the process after attempting to dump debug info, due to ProcessGroupNCCL watchdog hang.
[rank5]:[E217 04:33:09.329933871 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 5] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. 
[rank2]:[E217 04:33:09.338193755 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 2] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. 
[rank4]:[E217 04:33:09.344455978 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 4] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. 
[rank4]:[E217 04:33:11.155788790 ProcessGroupNCCL.cpp:1515] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
[rank5]:[E217 04:33:10.470933723 ProcessGroupNCCL.cpp:1515] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
[rank2]:[E217 04:33:13.305947571 ProcessGroupNCCL.cpp:1515] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
W0217 04:33:15.292410 7362 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7427 closing signal SIGTERM
W0217 04:33:16.276860 7362 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7429 closing signal SIGTERM
W0217 04:33:16.277613 7362 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7431 closing signal SIGTERM
W0217 04:33:16.278180 7362 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7432 closing signal SIGTERM
W0217 04:33:16.278744 7362 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7433 closing signal SIGTERM
W0217 04:33:16.279249 7362 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7434 closing signal SIGTERM
E0217 04:33:24.095435 7362 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -9) local_rank: 3 (pid: 7430) of binary: /mnt/4tdisk/huazhou/envs/llamafactorypy310/bin/python
Traceback (most recent call last):
  File "/mnt/4tdisk/huazhou/envs/llamafactorypy310/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/4tdisk/huazhou/envs/llamafactorypy310/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/mnt/4tdisk/huazhou/envs/llamafactorypy310/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/mnt/4tdisk/huazhou/envs/llamafactorypy310/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/mnt/4tdisk/huazhou/envs/llamafactorypy310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/4tdisk/huazhou/envs/llamafactorypy310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/huazhou/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-17_04:33:15
  host      : bys
  rank      : 3 (local_rank: 3)
  exitcode  : -9 (pid: 7430)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 7430
============================================================

Others

No response

@hiyouga 请大佬帮忙看看,感谢感谢!!!

erichuazhou avatar Feb 17 '25 07:02 erichuazhou

1.DeepSeek-R1-Distill-Qwen-32B 的模版是qwen吧 2.8卡3090 可能显存不够

jienimi avatar Feb 21 '25 01:02 jienimi

1.DeepSeek-R1-Distill-Qwen-32B 的模版是qwen吧 2.8卡3090 可能显存不够

1、template: deepseek3,模板是deepseek3 2、8*24G,采用z3-offload,还是不够吗?我不确定需要多大的GPU。还请大神赐教。 @jienimi

erichuazhou avatar Feb 21 '25 02:02 erichuazhou

Root Cause (first observed failure): [0]: time : 2025-02-17_04:33:15 host : bys rank : 3 (local_rank: 3) exitcode : -9 (pid: 7430) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 7430

1.DeepSeek-R1-Distill-Qwen-32B 模板 应该是qwen 2. “8*24G,采用z3-offload” 计算一下显存应该是够。 但是 exitcode : -9 (pid: 7430) 我以前查资料时内存不够的原因,碰到相同的错误,后来我加了内存正常跑起来了。

jienimi avatar Feb 21 '25 09:02 jienimi

请教一下,这种带CoT的模型的微调数据,是不是也得用带有的数据去微调呀?

Jimmy-L99 avatar Feb 21 '25 18:02 Jimmy-L99

1.DeepSeek-R1-Distill-Qwen-32B 的模版是qwen吧 2.8卡3090 可能显存不够

1、template: deepseek3,模板是deepseek3 2、8*24G,采用z3-offload,还是不够吗?我不确定需要多大的GPU。还请大神赐教。 @jienimi

backbone是qwen,当然是用的qwen template,和deepseek无关

cehao628 avatar Feb 23 '25 15:02 cehao628

1.DeepSeek-R1-Distill-Qwen-32B 的模版是qwen吧 2.8卡3090 可能显存不够

1、template: deepseek3,模板是deepseek3 2、8*24G,采用z3-offload,还是不够吗?我不确定需要多大的GPU。还请大神赐教。 @jienimi

cutoff_len: 300也没什么意义...截断了思考过程

cehao628 avatar Feb 23 '25 15:02 cehao628

你好,请问你微调后有,在推理时,有遇到 如下报错吗?
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) TypeError: not a string

yumingfan-0219 avatar Mar 04 '25 10:03 yumingfan-0219

1.DeepSeek-R1-Distill-Qwen-32B 的模版是qwen吧 2.8卡3090 可能显存不够

1、template: deepseek3,模板是deepseek3 2、8*24G,采用z3-offload,还是不够吗?我不确定需要多大的GPU。还请大神赐教。 @jienimi

backbone是qwen,当然是用的qwen template,和deepseek无关

我确认了下是deepseek3的模板,虽然用的是qwen的backbone, 但是他们用的自己的模板sft

Linzwcs avatar Mar 07 '25 11:03 Linzwcs

内存不够,被系统杀掉了(kill -9)。加大内存就行

zsai001 avatar Mar 25 '25 11:03 zsai001