Reminder
- [X] I have read the README and searched the existing issues.
Reproduction
运行ASCEND_RT_VISIBLE_DEVICES=0 GRADIO_SHARE=1 llamafactory-cli webui启动可视化界面后,尝试进行训练时报错
`05/17/2024 01:30:26 - WARNING - llmtuner.model.utils.checkpointing - You are using the old GC format, some features (e.g. BAdam) will be invalid.
05/17/2024 01:30:26 - INFO - llmtuner.model.utils.checkpointing - Gradient checkpointing enabled.
05/17/2024 01:30:26 - INFO - llmtuner.model.utils.attention - Using vanilla attention implementation.
05/17/2024 01:30:26 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
[E OpParamMaker.cpp:273] call aclnnCast failed, detail:EZ9999: Inner Error!
EZ9999: 2024-05-17-01:30:26.960.959 Op Cast does not has any binary.
TraceBack (most recent call last):
Kernel Run failed. opType: 53, Cast
launch failed for Cast, errno:561000.
[ERROR] 2024-05-17-01:30:26 (PID:41961, Device:0, RankID:-1) ERR01005 OPS internal error
Exception raised from operator() at third_party/op-plugin/op_plugin/ops/base_ops/opapi/CastKernelNpuOpApi.cpp:33 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x68 (0xffffa7858538 in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x6c (0xffffa78058a0 in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: + 0x8ddac0 (0xfffdc1d3fac0 in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #3: + 0xe2696c (0xfffdc228896c in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #4: + 0x56b9f0 (0xfffdc19cd9f0 in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #5: + 0x56be18 (0xfffdc19cde18 in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #6: + 0x569e20 (0xfffdc19cbe20 in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #7: + 0xafe0c (0xffffa788ae0c in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: + 0x7088 (0xffffb1c12088 in /lib/aarch64-linux-gnu/libpthread.so.0)
Traceback (most recent call last):
File "/root/miniconda3/envs/llama/bin/llamafactory-cli", line 8, in
sys.exit(main())
File "/home/HwHiAiUser/LLaMA-Factory/src/llmtuner/cli.py", line 49, in main
run_exp()
File "/home/HwHiAiUser/LLaMA-Factory/src/llmtuner/train/tuner.py", line 33, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/home/HwHiAiUser/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 34, in run_sft
model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
File "/home/HwHiAiUser/LLaMA-Factory/src/llmtuner/model/loader.py", line 137, in load_model
model = init_adapter(config, model, model_args, finetuning_args, is_trainable)
File "/home/HwHiAiUser/LLaMA-Factory/src/llmtuner/model/adapter.py", line 196, in init_adapter
param.data = param.data.to(torch.float32)
RuntimeError: The Inner error is reported as above.
Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, pleace set the environment variable ASCEND_LAUNCH_BLOCKING=1.
[ERROR] 2024-05-17-01:30:26 (PID:41961, Device:0, RankID:-1) ERR00100 PTA call acl api failed`
Expected behavior
No response
System Info
No response
Others
No response