LLaMA-Factory 0.9.2版本训练deepseek3问题

0.9.2版本训练deepseek3问题

Open TexasRangers86 opened this issue 2 weeks ago • 0 comments

Reminder

[x] I have read the above rules and searched the existing issues.

System Info

问题描述使用0.9.2版本库训练deepseek3会卡住在这个位置，然后更换qwen 7b模型训练发现也卡在这个位置，之后使用之前的容器环境0.9.1版本，相同启动命令和配置文件，是可以正常训练的；目前怀疑是环境中多机交互相关库，如deepspeed等版本问题，请问0.9.2更新的版本训练deepseek3，我的依赖库环境版本有问题吗

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.6 [WARNING] using untested triton version (3.2.0), only 1.0.0 is known to be compatible /data/miniconda3/envs/fac/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /data/miniconda3/envs/fac/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output):

llamafactory version: 0.9.2.dev0
Platform: Linux-5.4.119-19.0009.28-x86_64-with-glibc2.17
Python version: 3.10.16
PyTorch version: 2.6.0+cu124 (GPU)
Transformers version: 4.48.3
Datasets version: 3.1.0
Accelerate version: 0.34.2
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA H800
GPU number: 8
GPU memory: 79.11GB
DeepSpeed version: 0.14.4

[2025-02-13 12:14:38,487] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.6 [WARNING] using untested triton version (3.2.0), only 1.0.0 is known to be compatible /data/miniconda3/envs/fac/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /data/miniconda3/envs/fac/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): [INFO|2025-02-13 12:14:40] llamafactory.cli:157 >> Initializing distributed tasks at: 10.126.218.43:29500 W0213 12:14:41.474000 58459 site-packages/torch/distributed/run.py:792] W0213 12:14:41.474000 58459 site-packages/torch/distributed/run.py:792] ***************************************** W0213 12:14:41.474000 58459 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0213 12:14:41.474000 58459 site-packages/torch/distributed/run.py:792] *****************************************

Reproduction

Put your message here.

Others

No response

Feb 13 '25 04:02 TexasRangers86

LLaMA-Factory LLaMA-Factory copied to clipboard

0.9.2版本训练deepseek3问题

Reminder

System Info

Reproduction

Others

LLaMA-Factory
LLaMA-Factory copied to clipboard