LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

0.9.2版本训练deepseek3问题

Open TexasRangers86 opened this issue 2 weeks ago • 0 comments

Reminder

  • [x] I have read the above rules and searched the existing issues.

System Info

问题描述 使用0.9.2版本库训练deepseek3会卡住在这个位置,然后更换qwen 7b模型训练发现也卡在这个位置,之后使用之前的容器环境0.9.1版本,相同启动命令和配置文件,是可以正常训练的;目前怀疑是环境中多机交互相关库,如deepspeed等版本问题,请问0.9.2更新的版本训练deepseek3,我的依赖库环境版本有问题吗

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.6 [WARNING] using untested triton version (3.2.0), only 1.0.0 is known to be compatible /data/miniconda3/envs/fac/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /data/miniconda3/envs/fac/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output):

  • llamafactory version: 0.9.2.dev0
  • Platform: Linux-5.4.119-19.0009.28-x86_64-with-glibc2.17
  • Python version: 3.10.16
  • PyTorch version: 2.6.0+cu124 (GPU)
  • Transformers version: 4.48.3
  • Datasets version: 3.1.0
  • Accelerate version: 0.34.2
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA H800
  • GPU number: 8
  • GPU memory: 79.11GB
  • DeepSpeed version: 0.14.4

[2025-02-13 12:14:38,487] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.6 [WARNING] using untested triton version (3.2.0), only 1.0.0 is known to be compatible /data/miniconda3/envs/fac/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /data/miniconda3/envs/fac/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): [INFO|2025-02-13 12:14:40] llamafactory.cli:157 >> Initializing distributed tasks at: 10.126.218.43:29500 W0213 12:14:41.474000 58459 site-packages/torch/distributed/run.py:792] W0213 12:14:41.474000 58459 site-packages/torch/distributed/run.py:792] ***************************************** W0213 12:14:41.474000 58459 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0213 12:14:41.474000 58459 site-packages/torch/distributed/run.py:792] *****************************************

Reproduction

Put your message here.

Others

No response

TexasRangers86 avatar Feb 13 '25 04:02 TexasRangers86