LLaMA-Factory
LLaMA-Factory copied to clipboard
0.9.2版本训练deepseek3问题
Reminder
- [x] I have read the above rules and searched the existing issues.
System Info
问题描述 使用0.9.2版本库训练deepseek3会卡住在这个位置,然后更换qwen 7b模型训练发现也卡在这个位置,之后使用之前的容器环境0.9.1版本,相同启动命令和配置文件,是可以正常训练的;目前怀疑是环境中多机交互相关库,如deepspeed等版本问题,请问0.9.2更新的版本训练deepseek3,我的依赖库环境版本有问题吗
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.6
[WARNING] using untested triton version (3.2.0), only 1.0.0 is known to be compatible
/data/miniconda3/envs/fac/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...)
is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda')
instead.
def forward(ctx, input, weight, bias=None):
/data/miniconda3/envs/fac/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...)
is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda')
instead.
def backward(ctx, grad_output):
-
llamafactory
version: 0.9.2.dev0 - Platform: Linux-5.4.119-19.0009.28-x86_64-with-glibc2.17
- Python version: 3.10.16
- PyTorch version: 2.6.0+cu124 (GPU)
- Transformers version: 4.48.3
- Datasets version: 3.1.0
- Accelerate version: 0.34.2
- PEFT version: 0.12.0
- TRL version: 0.9.6
- GPU type: NVIDIA H800
- GPU number: 8
- GPU memory: 79.11GB
- DeepSpeed version: 0.14.4
[2025-02-13 12:14:38,487] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.6
[WARNING] using untested triton version (3.2.0), only 1.0.0 is known to be compatible
/data/miniconda3/envs/fac/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...)
is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda')
instead.
def forward(ctx, input, weight, bias=None):
/data/miniconda3/envs/fac/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...)
is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda')
instead.
def backward(ctx, grad_output):
[INFO|2025-02-13 12:14:40] llamafactory.cli:157 >> Initializing distributed tasks at: 10.126.218.43:29500
W0213 12:14:41.474000 58459 site-packages/torch/distributed/run.py:792]
W0213 12:14:41.474000 58459 site-packages/torch/distributed/run.py:792] *****************************************
W0213 12:14:41.474000 58459 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0213 12:14:41.474000 58459 site-packages/torch/distributed/run.py:792] *****************************************
Reproduction
Put your message here.
Others
No response