在昇腾npu环境下运行报错

Open feria-tu opened this issue 1 year ago • 0 comments

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

运行ASCEND_RT_VISIBLE_DEVICES=0 GRADIO_SHARE=1 llamafactory-cli webui启动可视化界面后，尝试进行训练时报错

`05/17/2024 01:30:26 - WARNING - llmtuner.model.utils.checkpointing - You are using the old GC format, some features (e.g. BAdam) will be invalid. 05/17/2024 01:30:26 - INFO - llmtuner.model.utils.checkpointing - Gradient checkpointing enabled. 05/17/2024 01:30:26 - INFO - llmtuner.model.utils.attention - Using vanilla attention implementation. 05/17/2024 01:30:26 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA [E OpParamMaker.cpp:273] call aclnnCast failed, detail:EZ9999: Inner Error! EZ9999: 2024-05-17-01:30:26.960.959 Op Cast does not has any binary. TraceBack (most recent call last): Kernel Run failed. opType: 53, Cast launch failed for Cast, errno:561000.

[ERROR] 2024-05-17-01:30:26 (PID:41961, Device:0, RankID:-1) ERR01005 OPS internal error Exception raised from operator() at third_party/op-plugin/op_plugin/ops/base_ops/opapi/CastKernelNpuOpApi.cpp:33 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x68 (0xffffa7858538 in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x6c (0xffffa78058a0 in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch/lib/libc10.so) frame #2: + 0x8ddac0 (0xfffdc1d3fac0 in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so) frame #3: + 0xe2696c (0xfffdc228896c in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so) frame #4: + 0x56b9f0 (0xfffdc19cd9f0 in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so) frame #5: + 0x56be18 (0xfffdc19cde18 in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so) frame #6: + 0x569e20 (0xfffdc19cbe20 in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so) frame #7: + 0xafe0c (0xffffa788ae0c in /root/miniconda3/envs/llama/lib/python3.10/site-packages/torch/lib/libc10.so) frame #8: + 0x7088 (0xffffb1c12088 in /lib/aarch64-linux-gnu/libpthread.so.0)

Traceback (most recent call last): File "/root/miniconda3/envs/llama/bin/llamafactory-cli", line 8, in sys.exit(main()) File "/home/HwHiAiUser/LLaMA-Factory/src/llmtuner/cli.py", line 49, in main run_exp() File "/home/HwHiAiUser/LLaMA-Factory/src/llmtuner/train/tuner.py", line 33, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/home/HwHiAiUser/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 34, in run_sft model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train) File "/home/HwHiAiUser/LLaMA-Factory/src/llmtuner/model/loader.py", line 137, in load_model model = init_adapter(config, model, model_args, finetuning_args, is_trainable) File "/home/HwHiAiUser/LLaMA-Factory/src/llmtuner/model/adapter.py", line 196, in init_adapter param.data = param.data.to(torch.float32) RuntimeError: The Inner error is reported as above. Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, pleace set the environment variable ASCEND_LAUNCH_BLOCKING=1. [ERROR] 2024-05-17-01:30:26 (PID:41961, Device:0, RankID:-1) ERR00100 PTA call acl api failed`

Expected behavior

No response

System Info

No response

Others

No response

May 17 '24 01:05 feria-tu