ms-swift icon indicating copy to clipboard operation
ms-swift copied to clipboard

NPU训练qwen2.5-vl报错

Open ChingKwanCheung opened this issue 10 months ago • 5 comments

参数如下: swift sft
--model XX/qwen25-vl-3B
--model_type qwen2_5_vl
--num_train_epochs 1
--freeze_llm False
--freeze_vit False
--freeze_aligner False
--dataset $train_list
--max_pixels 1330000
--max_length 4096
--eval_steps 1000
--eval_strategy no
--save_steps 300
--save_total_limit 2
--train_type full
--per_device_train_batch_size 1
--learning_rate 1e-5
--output_dir output
--deepspeed zero2
--ddp_backend hccl
--truncation_strategy right
--torch_dtype float16
--lora_dtype float16

报错如下: Traceback (most recent call last): File "/home/ma-user/work/ms-swift-main/swift/cli/sft.py", line 16, in sft_main() File "/home/ma-user/work/ms-swift-main/swift/llm/train/sft.py", line 263, in sft_main return SwiftSft(args).main() File "/home/ma-user/work/ms-swift-main/swift/llm/base.py", line 46, in main result = self.run() File "/home/ma-user/work/ms-swift-main/swift/llm/train/sft.py", line 143, in run return self.train(trainer) File "/home/ma-user/work/ms-swift-main/swift/llm/train/sft.py", line 202, in train trainer.train(trainer.args.resume_from_checkpoint) File "/home/ma-user/work/ms-swift-main/swift/trainers/mixin.py", line 266, in train Traceback (most recent call last): File "/home/ma-user/work/ms-swift-main/swift/cli/sft.py", line 16, in res = super().train(*args, **kwargs) File "/home/ma-user/env/swift3newversion/lib/python3.10/site-packages/transformers/trainer.py", line 2185, in train sft_main() File "/home/ma-user/work/ms-swift-main/swift/llm/train/sft.py", line 263, in sft_main return inner_training_loop( File "/home/ma-user/env/swift3newversion/lib/python3.10/site-packages/transformers/trainer.py", line 2491, in _inner_training_loop return SwiftSft(args).main() File "/home/ma-user/work/ms-swift-main/swift/llm/base.py", line 46, in main tr_loss_step = self.training_step(model, inputs, num_items_in_batch) File "/home/ma-user/env/swift3newversion/lib/python3.10/site-packages/transformers/trainer.py", line 3652, in training_step result = self.run() File "/home/ma-user/work/ms-swift-main/swift/llm/train/sft.py", line 143, in run self.accelerator.backward(loss, **kwargs) File "/home/ma-user/env/swift3newversion/lib/python3.10/site-packages/accelerate/accelerator.py", line 2238, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/home/ma-user/env/swift3newversion/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 261, in backward self.engine.backward(loss, **kwargs) File "/home/ma-user/env/swift3newversion/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/ma-user/env/swift3newversion/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2053, in backward return self.train(trainer) File "/home/ma-user/work/ms-swift-main/swift/llm/train/sft.py", line 202, in train self.optimizer.backward(loss, retain_graph=retain_graph) File "/home/ma-user/env/swift3newversion/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2058, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/home/ma-user/env/swift3newversion/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/home/ma-user/env/swift3newversion/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward trainer.train(trainer.args.resume_from_checkpoint) File "/home/ma-user/work/ms-swift-main/swift/trainers/mixin.py", line 266, in train torch.autograd.backward( File "/home/ma-user/env/swift3newversion/lib/python3.10/site-packages/torch/autograd/init.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: InnerRun:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:200 OPS function error: Conv3DBackpropFilter, error code is 500002

ChingKwanCheung avatar Mar 06 '25 13:03 ChingKwanCheung

环境如下: absl-py 2.1.0 accelerate 1.3.0 addict 2.4.0 aiofiles 23.2.1 aiohappyeyeballs 2.4.4 aiohttp 3.11.11 aiosignal 1.3.2 aliyun-python-sdk-core 2.16.0 aliyun-python-sdk-kms 2.16.5 annotated-types 0.7.0 anyio 4.8.0 arrow 1.3.0 ascendebug 0.1.0 asttokens 2.4.1 astunparse 1.6.3 async-timeout 5.0.1 attrdict 2.0.1 attrs 23.2.0 auto_tune 0.1.0 av 14.0.1 binaryornot 0.4.4 binpacking 1.5.2 certifi 2024.8.30 cffi 1.17.1 chardet 5.2.0 charset-normalizer 3.3.2 click 8.1.7 configparser 6.0.0 contourpy 1.3.1 cookiecutter 2.6.0 cpm-kernels 1.0.11 crcmod 1.7 cryptography 3.4.7 cycler 0.12.1 dacite 1.8.1 dataflow 0.0.1 datasets 3.2.0 debugpy 1.8.5 decorator 5.1.1 deepspeed 0.16.2 dill 0.3.8 distro 1.9.0 docstring_parser 0.16 einops 0.8.0 entrypoints 0.4 esdk-obs-python 3.23.12 exceptiongroup 1.2.2 executing 2.1.0 fastapi 0.115.6 ffmpy 0.5.0 filelock 3.16.1 flatbuffers 24.12.23 fonttools 4.54.1 frozenlist 1.5.0 fsspec 2024.9.0 future 1.0.0 gast 0.6.0 google-pasta 0.2.0 gradio 5.12.0 gradio_client 1.5.4 grpcio 1.69.0 h11 0.14.0 h5py 3.12.1 hccl 0.1.0 hccl_parser 0.1 hjson 3.1.0 httpcore 1.0.7 httpx 0.28.1 huaweicloudsdkcore 3.1.94 huggingface-hub 0.27.1 idna 3.8 importlib_metadata 8.5.0 ipykernel 6.7.0 ipython 8.27.0 jedi 0.19.1 jieba 0.42.1 Jinja2 3.1.4 jiter 0.8.2 jmespath 0.10.0 joblib 1.4.2 jupyter_client 7.4.9 jupyter_core 5.7.2 keras 3.8.0 kiwisolver 1.4.8 lazy-import 0.2.2 libclang 18.1.1 llm_datadist 0.0.1 lxml 5.3.0 ma-cli 1.2.3 Markdown 3.7 markdown-it-py 3.0.0 MarkupSafe 2.1.5 matplotlib 3.10.1 matplotlib-inline 0.1.7 mdurl 0.1.2 ml-dtypes 0.4.1 mock 5.1.0 modelarts 1.4.28 modelscope 1.22.2 moxing-framework 2.2.8.0aa484aa mpmath 1.3.0 ms_swift 3.2.0.dev0 /home/ma-user/work/z00854892/ms-swift-main msgpack 1.1.0 msobjdump 0.1.0 multidict 6.1.0 multiprocess 0.70.16 namex 0.0.8 nest-asyncio 1.6.0 networkx 3.2.1 ninja 1.11.1.3 nltk 3.9.1 npu_bridge 1.15.0 npu_device 0.1 numpy 1.26.0 op_compile_tool 0.1.0 op_gen 0.1 op_test_frame 0.1 opc_tool 0.1.0 openai 1.59.8 opt_einsum 3.4.0 optree 0.14.0 orjson 3.10.14 oss2 2.19.1 packaging 24.1 pandas 2.2.3 parso 0.8.4 pathlib2 2.3.7.post1 peft 0.14.0 pexpect 4.9.0 pillow 10.4.0 pip 22.3.1 platformdirs 4.3.2 prettytable 3.7.0 prompt_toolkit 3.0.47 propcache 0.2.1 protobuf 3.20.3 psutil 6.0.0 ptyprocess 0.7.0 pure_eval 0.2.3 py-cpuinfo 9.0.0 pyarrow 19.0.0 pyasn1 0.5.1 pycparser 2.22 pycryptodome 3.21.0 pydantic 2.10.5 pydantic_core 2.27.2 pydub 0.25.1 Pygments 2.18.0 pyparsing 3.2.0 python-dateutil 2.9.0.post0 python-multipart 0.0.20 python-slugify 8.0.4 pytz 2024.2 PyYAML 6.0.2 pyzmq 26.2.0 qwen-vl-utils 0.0.10 regex 2024.11.6 requests 2.32.3 requests-toolbelt 1.0.0 rich 13.9.2 rouge 1.0.1 ruff 0.9.2 safehttpx 0.1.6 safetensors 0.5.2 schedule_search 0.0.1 scipy 1.14.1 semantic-version 2.10.0 sentencepiece 0.2.0 setuptools 69.5.1 shellingham 1.5.4 shtab 1.7.1 simplejson 3.19.3 six 1.16.0 sniffio 1.3.1 sortedcontainers 2.4.0 stack-data 0.6.3 starlette 0.41.3 sympy 1.13.1 tabulate 0.9.0 te 0.4.0 tenacity 8.2.2 tensorboard 2.18.0 tensorboard-data-server 0.7.2 tensorflow 2.18.0 tensorflow-io 0.37.1 tensorflow-io-gcs-filesystem 0.37.1 termcolor 2.5.0 text-unidecode 1.3 tf_keras 2.18.0 tiktoken 0.8.0 timm 1.0.13 tokenizers 0.21.0 tomlkit 0.13.2 torch 2.1.0 torch-npu 2.1.0.post8 torchvision 0.16.0 tornado 6.4.1 tqdm 4.66.5 traitlets 5.14.3 transformers 4.49.0.dev0 transformers-stream-generator 0.0.5 trl 0.15.2 typeguard 4.4.1 typer 0.15.1 types-python-dateutil 2.9.0.20241003 typing_extensions 4.12.2 tyro 0.9.11 tzdata 2024.2 urllib3 2.2.2 uvicorn 0.34.0 wcwidth 0.2.13 websockets 14.1 Werkzeug 3.1.3 wheel 0.38.4 wrapt 1.17.2 xxhash 3.5.0 yarl 1.18.3 zipp 3.21.0 zstandard 0.23.0

ChingKwanCheung avatar Mar 07 '25 02:03 ChingKwanCheung

把VIT冻住可以训--freeze_vit True \ 全参微调报如上错误

ChingKwanCheung avatar Mar 07 '25 07:03 ChingKwanCheung

老哥,这个问题 你解决了么

dashun0571 avatar Mar 19 '25 02:03 dashun0571

Same problem. Is the influence of npu operator for vit training (lack of specified operator) ?

PatrickYangSG avatar May 02 '25 02:05 PatrickYangSG

升级到最新的版本 然后设置--fp16 false --bf16 false 试试

Jintao-Huang avatar May 02 '25 02:05 Jintao-Huang

请问这个问题现在解决了吗,刚用swift训练也遇到这个问题了

JackeyGuo avatar Jun 09 '25 12:06 JackeyGuo

请问这个问题解决了吗,我目前也有这个问题,swift也是比较新的

把VIT冻住可以训--freeze_vit True \ 全参微调报如上错误

zhaoyangwei123 avatar Aug 15 '25 07:08 zhaoyangwei123