Qwen3-VL-8B和4B训练时候的GPU使用率低下
Reminder
- [x] I have read the above rules and searched the existing issues.
System Info
llamafactoryversion: 0.9.4.dev0- Platform: Linux-6.14.0-28-generic-x86_64-with-glibc2.39
- Python version: 3.11.13
- PyTorch version: 2.9.0-rc9 (GPU)
- Transformers version: 4.57.1
- Datasets version: 4.0.0
- Accelerate version: 1.10.1
- PEFT version: 0.17.1
- GPU type: NVIDIA GeForce RTX 5090 D
- GPU number: 2
- GPU memory: 31.36GB
- TRL version: 0.9.6
- DeepSpeed version: 0.16.9
- Bitsandbytes version: 0.48.1
- Default data directory: detected
使用的llamafactory已编译commit 1037f63
Reproduction
llamafactory-cli train \
--stage sft \
--do_train True \
--model_name_or_path /app/models/Qwen3-VL-8B-Instruct \
--preprocessing_num_workers 16 \
--finetuning_type lora \
--template qwen3_vl_nothink \
--flash_attn auto \
--dataset_dir data \
--dataset mllm_demo2 \
--cutoff_len 10240 \
--learning_rate 5e-05 \
--num_train_epochs 100.0 \
--max_samples 100000 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--max_grad_norm 1.0 \
--logging_steps 5 \
--save_steps 100 \
--warmup_steps 0 \
--packing False \
--enable_thinking False \
--report_to none \
--output_dir saves/Qwen3-VL-8B-Instruct/lora/train_2025-10-16-10-19-04 \
--bf16 True \
--plot_loss True \
--trust_remote_code True \
--ddp_timeout 180000000 \
--include_num_input_tokens_seen True \
--optim adamw_torch \
--lora_rank 8 \
--lora_alpha 16 \
--lora_dropout 0 \
--lora_target all \
--freeze_vision_tower False \
--freeze_multi_modal_projector False \
--image_max_pixels 960400 \
--image_min_pixels 1024 \
--video_max_pixels 65536 \
--video_min_pixels 256
两张卡在训练Qwen3-VL的8B或4B模型的时候,GPU使用率低下(但不报错,能正常训练)
Others
一样的环境,使用以下命令(几乎一样)训练Qwen2.5-VL-7B,GPU使用率正常的:
llamafactory-cli train \
--stage sft \
--do_train True \
--model_name_or_path /app/models/Qwen2.5-VL-7B-Instruct \
--preprocessing_num_workers 16 \
--finetuning_type lora \
--template qwen2_vl \
--flash_attn auto \
--dataset_dir data \
--dataset mllm_demo2 \
--cutoff_len 10240 \
--learning_rate 5e-05 \
--num_train_epochs 100.0 \
--max_samples 100000 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--max_grad_norm 1.0 \
--logging_steps 5 \
--save_steps 100 \
--warmup_steps 0 \
--packing False \
--enable_thinking False \
--report_to none \
--output_dir saves/Qwen2.5-VL-7B-Instruct/lora/train_2025-10-16-10-31-04 \
--bf16 True \
--plot_loss True \
--trust_remote_code True \
--ddp_timeout 180000000 \
--include_num_input_tokens_seen True \
--optim adamw_torch \
--lora_rank 8 \
--lora_alpha 16 \
--lora_dropout 0 \
--lora_target all \
--freeze_vision_tower False \
--freeze_multi_modal_projector False \
--image_max_pixels 960400 \
--image_min_pixels 1024 \
--video_max_pixels 65536 \
--video_min_pixels 256
同样问题
老哥你的训练时间正常吗?我在相同的设置下sft qwen3vl时间几乎是qwen2.5vl的几倍
可能环境问题,4090没有这个问题,跑满了
同样存在类似问题,
训练时长增加2-3倍
我也存在同样问题,相同数据集和配置,qwen3vl-8b 全参sft时长比qwen2.5vl-7b增加3倍以上
@mengzmd @JunchenHuang777 老哥有跑完最终的结果吗 有掉点吗 我的任务切换qwen3VL之后评测指标还掉了接近4个点 :(
大家可以在这里分享一下自己的硬件环境,环境设置以及数据集特点 看看是哪里有问题, 在我们自己的测试过程中没有发现特别严重的速度问题
System Info gpu:2×A800 cuda:11.8 python3.10 Package Version
absl-py 1.3.0 annoy 1.17.3 apex 0.1 appdirs 1.4.4 argon2-cffi 21.3.0 argon2-cffi-bindings 21.2.0 args 0.1.0 asttokens 2.2.1 astunparse 1.6.3 atari-py 0.2.9 attrs 22.1.0 audioread 3.0.0 backcall 0.2.0 beautifulsoup4 4.11.1 bleach 5.0.1 blis 0.7.9 box2d-py 2.3.8 cachetools 5.2.0 catalogue 2.0.8 certifi 2022.12.7 cffi 1.15.1 chardet 3.0.4 charset-normalizer 2.1.1 click 8.1.3 clint 0.5.1 cloudpickle 2.2.0 cmake 3.24.1.1 comm 0.2.1 confection 0.0.3 contourpy 1.0.6 cuda-python 11.7.0+0.g95a2041.dirty cudf 22.10.0a0+316.gad1ba132d2.dirty cugraph 22.10.0a0+113.g6bbdadf8.dirty cuml 22.10.0a0+56.g3a8dea659.dirty cupy-cuda111 12.3.0 cupy-cuda118 11.0.0 cycler 0.11.0 cymem 2.0.7 Cython 0.29.32 cytoolz 0.12.2 dask 2022.9.2 dask-cuda 22.10.0a0+23.g62a1ee8 dask-cudf 22.10.0a0+316.gad1ba132d2.dirty dbus-python 1.2.16 debugpy 1.6.4 decorator 5.1.1 defusedxml 0.7.1 distlib 0.3.8 distributed 2022.9.2 distro 1.4.0 dlib 19.24.2 entrypoints 0.4 exceptiongroup 1.0.4 execnet 1.9.0 executing 1.2.0 expecttest 0.1.3 fastjsonschema 2.16.2 fastrlock 0.8.1 filelock 3.13.1 fonttools 4.38.0 fsspec 2022.11.0 funcsigs 1.0.2 google-auth 2.15.0 google-auth-oauthlib 0.4.6 gpg 1.13.1 graphsurgeon 0.4.6 grpcio 1.51.1 gym 0.26.2 gym-notices 0.0.8 gym-retro 0.8.0 HeapDict 1.0.1 hypothesis 5.35.1 idna 3.4 imagecodecs 2023.3.16 importlib-metadata 5.1.0 importlib-resources 5.10.1 incremental 22.10.0 iniconfig 1.1.1 intel-openmp 2021.4.0 ipykernel 6.19.2 ipython 8.7.0 ipython-genutils 0.2.0 ipywidgets 8.1.1 jedi 0.18.2 jellyfish 1.0.3 Jinja2 3.1.2 joblib 1.2.0 json5 0.9.10 jsonschema 4.17.3 jupyter_client 7.4.8 jupyter_core 5.1.0 jupyter-tensorboard 0.2.0 jupyterlab 2.3.2 jupyterlab-pygments 0.2.2 jupyterlab-server 1.2.0 jupyterlab-widgets 3.0.9 jupytext 1.14.4 kaggle 1.5.16 kiwisolver 1.4.4 langcodes 3.3.0 librosa 0.9.2 llvmlite 0.39.1 locket 1.0.0 Markdown 3.4.1 markdown-it-py 2.1.0 MarkupSafe 2.1.1 marshmallow 3.20.1 matplotlib 3.6.2 matplotlib-inline 0.1.6 mdit-py-plugins 0.3.3 mdurl 0.1.2 menpo 0.11.0 mistune 2.0.4 mkl 2021.1.1 mkl-devel 2021.1.1 mkl-include 2021.1.1 mock 4.0.3 mpmath 1.2.1 msgpack 1.0.4 murmurhash 1.0.9 nbclient 0.7.2 nbconvert 7.2.6 nbformat 5.7.0 nest-asyncio 1.5.6 networkx 2.6.3 notebook 6.4.10 numba 0.56.4 numpy 1.22.2 nvidia-dali-cuda110 1.20.0 nvidia-pyindex 1.0.9 nvtx 0.2.5 oauthlib 3.2.2 onnx 1.12.0 opencv 4.6.0 packaging 22.0 pandas 1.5.2 pandocfilters 1.5.0 parso 0.8.3 partd 1.3.0 path 16.9.0 path.py 12.5.0 pathlib2 2.3.7.post1 pathy 0.10.1 pbr 6.0.0 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.2.0 pip 21.2.4 pkgutil_resolve_name 1.3.10 platformdirs 4.1.0 plotly 5.18.0 pluggy 1.0.0 polygraphy 0.43.1 pooch 1.6.0 preshed 3.0.8 prettytable 3.5.0 prometheus-client 0.15.0 prompt-toolkit 3.0.36 protobuf 3.20.1 psutil 5.9.4 ptyprocess 0.7.0 pure-eval 0.2.2 pyarrow 9.0.0 pyasn1 0.4.8 pyasn1-modules 0.2.8 pybind11 2.10.1 pycocotools 2.0+nv0.7.1 pycparser 2.21 pycrypto 2.6.1 pydantic 1.10.2 pyemd 1.0.0 pyglet 1.5.28 Pygments 2.13.0 PyGObject 3.36.0 pylibcugraph 22.10.0a0+113.g6bbdadf8.dirty pylibraft 22.10.0a0+81.g08abc72.dirty pynvml 11.4.1 pynvrtc 9.2 PyOpenGL 3.1.7 PyOpenGL-accelerate 3.1.7 pyparsing 3.0.9 pyphen 0.14.0 pyrsistent 0.19.2 pytest 7.2.0 pytest-rerunfailures 10.3 pytest-shard 0.1.2 pytest-xdist 3.1.0 python-dateutil 2.8.2 python-hostlist 1.22 python-slugify 8.0.1 pytorch-quantization 2.1.2 pytz 2022.6 PyYAML 6.0 pyzmq 24.0.1 qtconsole 5.5.1 QtPy 2.4.1 raft-dask 22.10.0a0+81.g08abc72.dirty raven 6.10.0 regex 2022.10.31 requests 2.28.1 requests-oauthlib 1.3.1 requests-toolbelt 1.0.0 resampy 0.4.2 retrowrapper 0.3.0 rmm 22.10.0a0+38.ge043158.dirty rsa 4.9 scikit-learn 0.24.2 scipy 1.6.3 seaborn 0.13.1 Send2Trash 1.8.0 setuptools 59.5.0 six 1.16.0 smart-open 6.3.0 sortedcontainers 2.4.0 soundfile 0.11.0 soupsieve 2.3.2.post1 spacy 3.4.4 spacy-legacy 3.0.10 spacy-loggers 1.0.4 sphinx-glpi-theme 0.3 srsly 2.4.5 ssh-import-id 5.10 stack-data 0.6.2 sympy 1.11.1 tbb 2021.7.1 tblib 1.7.0 tenacity 8.2.3 tensorboard 2.9.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 tensorrt 8.5.1.7 terminado 0.17.1 text-unidecode 1.3 thinc 8.1.5 threadpoolctl 3.1.0 tifffile 2023.7.10 tinycss2 1.2.1 toml 0.10.2 tomli 2.0.1 toolz 0.12.0 torch 1.14.0a0+410ce96 torch-tensorrt 1.3.0a0 torchtext 0.13.0a0+fae8e8c torchvision 0.15.0a0 tornado 6.4 tqdm 4.64.1 traitlets 5.7.1 transformer-engine 0.3.0 treelite 2.4.0 treelite-runtime 2.4.0 typer 0.7.0 typing_extensions 4.4.0 ucx-py 0.27.0a0+29.ge9e81f8 uff 0.6.9 urllib3 1.26.13 virtualenv 20.25.0 visdom 0.2.4 wasabi 0.10.1 wcwidth 0.2.5 webencodings 0.5.1 websocket-client 1.7.0 Werkzeug 2.2.2 wheel 0.38.4 widgetsnbextension 4.0.9 xdoctest 1.0.2 xgboost 1.6.2 zict 2.2.0 zipp 3.11.0 zmq 0.0.0 训练参数
model
model_name_or_path: /dfs/data/qwen3_vl_8b image_max_pixels: 1048576
method
stage: sft do_train: true finetuning_type: full freeze_vision_tower: false freeze_multi_modal_projector: false freeze_language_model: false deepspeed: examples/deepspeed/ds_z3_config.json # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]
dataset
dataset: lingbujian_fuza_prompt template: qwen3_vl cutoff_len: 2048 max_samples: 30000 overwrite_cache: true preprocessing_num_workers: 16
output
output_dir: saves/qwen3vl/full/sft save_steps: 500 plot_loss: true overwrite_output_dir: true
train
per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 1.0e-5 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000
eval
per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 2000 eval_dataset: lingbujian_val compute_accuracy: true 训练参数和qwen2.5_vl一致, 数据集任务为对图像分类,答案为类别,训练数据2.4W,验证数据0.7W, 同样数据和训练参数在使用qwen2.5_vl全参微调时大约为160h
@Kuangdd01 Qwen3-VL中控制图片大小的参数好像和Qwen2.5-VL中的不同,由IMAGE_MAX_PIXELS变为IMAGE_MAX_TOKEN_NUM,请看https://github.com/QwenLM/Qwen3-VL/blob/6d08b04928bd3914b353f833dfe71de83989dfb9/qwen-vl-utils/src/qwen_vl_utils/vision_process.py#L26-L29,请问对这个适配了吗?我初步怀疑是这个的问题
System Info gpu:8×H100 cuda:12.3 python3.10
Package Version
accelerate 1.10.1 aiofiles 24.1.0 aiohappyeyeballs 2.6.1 aiohttp 3.13.1 aiosignal 1.4.0 annotated-types 0.7.0 antlr4-python3-runtime 4.9.3 anyio 4.11.0 async-timeout 5.0.1 attrs 25.4.0 audioread 3.0.1 av 16.0.1 Brotli 1.1.0 certifi 2025.10.5 cffi 2.0.0 charset-normalizer 3.4.4 click 8.3.0 contourpy 1.3.2 cycler 0.12.1 datasets 4.0.0 decorator 5.2.1 deepspeed 0.16.9 dill 0.3.8 docstring_parser 0.17.0 einops 0.8.1 exceptiongroup 1.3.0 fastapi 0.119.0 ffmpy 0.6.3 filelock 3.20.0 fire 0.7.1 fonttools 4.60.1 frozenlist 1.8.0 fsspec 2025.3.0 gradio 5.45.0 gradio_client 1.13.0 groovy 0.1.2 h11 0.16.0 hf_transfer 0.1.9 hf-xet 1.1.10 hjson 3.1.0 httpcore 1.0.9 httpx 0.28.1 huggingface-hub 0.35.3 idna 3.11 jieba 0.42.1 Jinja2 3.1.6 joblib 1.5.2 kiwisolver 1.4.9 lazy_loader 0.4 librosa 0.11.0 llamafactory 0.9.4.dev0 llvmlite 0.45.1 markdown-it-py 4.0.0 MarkupSafe 3.0.3 matplotlib 3.10.7 mdurl 0.1.2 modelscope 1.31.0 mpmath 1.3.0 msgpack 1.1.2 multidict 6.7.0 multiprocess 0.70.16 networkx 3.4.2 ninja 1.13.0 nltk 3.9.2 numba 0.62.1 numpy 1.26.4 nvidia-cublas-cu12 12.8.4.1 nvidia-cuda-cupti-cu12 12.8.90 nvidia-cuda-nvrtc-cu12 12.8.93 nvidia-cuda-runtime-cu12 12.8.90 nvidia-cudnn-cu12 9.10.2.21 nvidia-cufft-cu12 11.3.3.83 nvidia-cufile-cu12 1.13.1.3 nvidia-curand-cu12 10.3.9.90 nvidia-cusolver-cu12 11.7.3.90 nvidia-cusparse-cu12 12.5.8.93 nvidia-cusparselt-cu12 0.7.1 nvidia-ml-py 13.580.82 nvidia-nccl-cu12 2.27.5 nvidia-nvjitlink-cu12 12.8.93 nvidia-nvshmem-cu12 3.3.20 nvidia-nvtx-cu12 12.8.90 omegaconf 2.3.0 orjson 3.11.3 packaging 25.0 pandas 2.3.3 peft 0.17.1 pillow 11.3.0 pip 24.0 platformdirs 4.5.0 pooch 1.8.2 propcache 0.4.1 protobuf 6.33.0 psutil 7.1.0 py-cpuinfo 9.0.0 pyarrow 21.0.0 pycparser 2.23 pydantic 2.10.6 pydantic_core 2.27.2 pydub 0.25.1 Pygments 2.19.2 pyparsing 3.2.5 python-dateutil 2.9.0.post0 python-multipart 0.0.20 pytz 2025.2 PyYAML 6.0.3 regex 2025.9.18 requests 2.32.5 rich 14.2.0 rouge-chinese 1.0.3 ruff 0.14.1 safehttpx 0.1.6 safetensors 0.5.3 scikit-learn 1.7.2 scipy 1.15.3 semantic-version 2.10.0 sentencepiece 0.2.1 setuptools 69.1.0 shellingham 1.5.4 shtab 1.7.2 six 1.17.0 sniffio 1.3.1 soundfile 0.13.1 soxr 1.0.0 sse-starlette 3.0.2 starlette 0.48.0 sympy 1.14.0 termcolor 3.1.0 threadpoolctl 3.6.0 tiktoken 0.12.0 tokenizers 0.22.1 tomlkit 0.13.3 torch 2.9.0 torchvision 0.24.0 tqdm 4.67.1 transformers 4.57.1 triton 3.5.0 trl 0.9.6 typer 0.19.2 typing_extensions 4.15.0 tyro 0.8.14 tzdata 2025.2 urllib3 2.5.0 uvicorn 0.38.0 websockets 15.0.1 wheel 0.42.0 xxhash 3.6.0 yarl 1.22.0
训练参数
model
model_name_or_path: ./qwen3-vl-8b image_max_pixels: 262144 video_max_pixels: 16384 trust_remote_code: true
method
stage: sft do_train: true finetuning_type: full freeze_vision_tower: true freeze_multi_modal_projector: true freeze_language_model: false deepspeed: examples/deepspeed/ds_z3_config.json
dataset
dataset: video_test template: qwen3_vl cutoff_len: 4096 overwrite_cache: false preprocessing_num_workers: 8 dataloader_num_workers: 8 tokenized_path: ./cache/video_test_tokenized
output
output_dir: ./sft_test logging_steps: 10 save_strategy: epoch plot_loss: true overwrite_output_dir: true save_only_model: false report_to: none # choices: [none, wandb, tensorboard, swanlab, mlflow]
train
per_device_train_batch_size: 4 gradient_accumulation_steps: 1 learning_rate: 1.0e-5 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000 resume_from_checkpoint: null
eval
val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: epoch
数据集都是视频,有5万2条数据,同样的数据及配置下 qwen3vl-8b 训练(除视频预处理及编码阶段) 119h,2.5vl-7b大概16h
System Info gpu: 8xH20 cuda 12.2 python 3.10
pip包: Package Version Editable project location
accelerate 1.10.1 aiofiles 24.1.0 aiohappyeyeballs 2.6.1 aiohttp 3.13.0 aiosignal 1.4.0 annotated-types 0.7.0 antlr4-python3-runtime 4.9.3 anyio 4.11.0 async-timeout 5.0.1 attrs 25.4.0 audioread 3.0.1 av 16.0.1 boto3 1.40.55 botocore 1.40.55 Brotli 1.1.0 certifi 2025.10.5 cffi 2.0.0 charset-normalizer 3.4.4 click 8.3.0 contourpy 1.3.2 cycler 0.12.1 datasets 4.0.0 decorator 5.2.1 dill 0.3.8 docstring_parser 0.17.0 einops 0.8.1 exceptiongroup 1.3.0 fastapi 0.119.0 ffmpy 0.6.3 filelock 3.20.0 fire 0.7.1 fonttools 4.60.1 frozenlist 1.8.0 fsspec 2025.3.0 gradio 5.45.0 gradio_client 1.13.0 groovy 0.1.2 h11 0.16.0 hf_transfer 0.1.9 hf-xet 1.1.10 httpcore 1.0.9 httpx 0.28.1 huggingface-hub 0.35.3 idna 3.11 jieba 0.42.1 Jinja2 3.1.6 jmespath 1.0.1 joblib 1.5.2 kiwisolver 1.4.9 lazy_loader 0.4 librosa 0.11.0 llamafactory 0.9.4.dev0 /home/work/workspace/llama_venv/LLaMA-Factory llvmlite 0.45.1 markdown-it-py 4.0.0 MarkupSafe 3.0.3 matplotlib 3.10.7 mdurl 0.1.2 modelscope 1.31.0 mpmath 1.3.0 msgpack 1.1.2 multidict 6.7.0 multiprocess 0.70.16 networkx 3.4.2 nltk 3.9.2 numba 0.62.1 numpy 1.26.4 nvidia-cublas-cu12 12.8.4.1 nvidia-cuda-cupti-cu12 12.8.90 nvidia-cuda-nvrtc-cu12 12.8.93 nvidia-cuda-runtime-cu12 12.8.90 nvidia-cudnn-cu12 9.10.2.21 nvidia-cufft-cu12 11.3.3.83 nvidia-cufile-cu12 1.13.1.3 nvidia-curand-cu12 10.3.9.90 nvidia-cusolver-cu12 11.7.3.90 nvidia-cusparse-cu12 12.5.8.93 nvidia-cusparselt-cu12 0.7.1 nvidia-ml-py 13.580.82 nvidia-nccl-cu12 2.27.5 nvidia-nvjitlink-cu12 12.8.93 nvidia-nvshmem-cu12 3.3.20 nvidia-nvtx-cu12 12.8.90 omegaconf 2.3.0 orjson 3.11.3 packaging 25.0 pandas 2.3.3 peft 0.17.1 pillow 11.3.0 pip 25.2 platformdirs 4.5.0 pooch 1.8.2 prettytable 3.16.0 propcache 0.4.1 protobuf 6.33.0 psutil 7.1.0 pyarrow 21.0.0 pycparser 2.23 pydantic 2.10.6 pydantic_core 2.27.2 pydub 0.25.1 pyecharts 2.0.9 Pygments 2.19.2 pyparsing 3.2.5 python-dateutil 2.9.0.post0 python-multipart 0.0.20 pytz 2025.2 PyYAML 6.0.3 regex 2025.9.18 requests 2.32.5 rich 13.9.4 rouge-chinese 1.0.3 ruff 0.14.1 s3transfer 0.14.0 safehttpx 0.1.6 safetensors 0.5.3 scikit-learn 1.7.2 scipy 1.15.3 semantic-version 2.10.0 sentencepiece 0.2.1 setuptools 80.9.0 shellingham 1.5.4 shtab 1.7.2 simplejson 3.20.2 six 1.17.0 sniffio 1.3.1 soundfile 0.13.1 soxr 1.0.0 sse-starlette 3.0.2 starlette 0.48.0 swanlab 0.6.12 sympy 1.14.0 termcolor 3.1.0 threadpoolctl 3.6.0 tiktoken 0.12.0 tokenizers 0.22.1 tomlkit 0.13.3 torch 2.9.0 torchvision 0.24.0 tqdm 4.67.1 transformers 4.57.1 triton 3.5.0 trl 0.9.6 typer 0.19.2 typing_extensions 4.15.0 tyro 0.8.14 tzdata 2025.2 urllib3 2.5.0 uvicorn 0.37.0 wcwidth 0.2.14 websockets 15.0.1 wrapt 2.0.0 xxhash 3.6.0 yarl 1.22.0
训练参数: model_name_or_path: /home/work/bos/Qwen3-VL-8B-Instruct image_max_pixels: 853000 # 1000 × 853 video_max_pixels: 16384
stage: sft do_train: true finetuning_type: lora lora_rank: 8 lora_target: all
dataset: poi_scene_pano_v1_train # video: mllm_video_demo template: qwen3_vl cutoff_len: 4096 max_samples: 10000 overwrite_cache: true preprocessing_num_workers: 16 dataloader_num_workers: 4
eval_dataset: poi_scene_pano_v1_val
do_eval: true per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 100 predict_with_generate: true
output_dir: saves/qwen3_vl-8b/lora/sft_v1 logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true save_only_model: false
per_device_train_batch_size: 4 gradient_accumulation_steps: 8 learning_rate: 1.0e-4 num_train_epochs: 10.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000 resume_from_checkpoint: null
use_swanlab: true swanlab_project: llamafactory_qwen3vl_8b swanlab_run_name: poi_scene_pano_lora_v1
训练时间:1w样本 79小时 2.5vl-7b 大约2h
一样的环境和启动方法,为什么我的会出现这个错误,你们当时出现过吗,是怎么解决的?方便帮忙看一下吗 ValueError: Processor was not found, please check and update your model file.
系统: 4卡4090 24G NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 python 3.11 环境: absl-py==2.3.1 accelerate==1.11.0 aiofiles==24.1.0 aiohappyeyeballs==2.6.1 aiohttp==3.13.1 aiosignal==1.4.0 annotated-types==0.7.0 antlr4-python3-runtime==4.9.3 anyio==4.11.0 attrs==25.4.0 audioread==3.0.1 av==16.0.1 Brotli==1.1.0 certifi==2025.10.5 cffi==2.0.0 charset-normalizer==3.4.4 click==8.3.0 contourpy==1.3.3 cycler==0.12.1 datasets==4.0.0 decorator==5.2.1 deepspeed==0.16.9 dill==0.3.8 docstring_parser==0.17.0 einops==0.8.1 fastapi==0.119.1 ffmpy==0.6.3 filelock==3.20.0 fire==0.7.1 flash_attn==2.8.3 fonttools==4.60.1 frozenlist==1.8.0 fsspec==2025.3.0 gradio==5.45.0 gradio_client==1.13.0 groovy==0.1.2 grpcio==1.76.0 h11==0.16.0 hf-xet==1.1.10 hf_transfer==0.1.9 hjson==3.1.0 httpcore==1.0.9 httpx==0.28.1 huggingface-hub==0.35.3 idna==3.11 jieba==0.42.1 Jinja2==3.1.6 joblib==1.5.2 kiwisolver==1.4.9 lazy_loader==0.4 librosa==0.11.0 Editable install with no version control (llamafactory==0.9.4.dev0) -e /path_to/LLaMA-Factory llvmlite==0.45.1 Markdown==3.9 markdown-it-py==4.0.0 MarkupSafe==3.0.3 matplotlib==3.10.7 mdurl==0.1.2 modelscope==1.31.0 mpmath==1.3.0 msgpack==1.1.2 multidict==6.7.0 multiprocess==0.70.16 networkx==3.5 ninja==1.13.0 nltk==3.9.2 numba==0.62.1 numpy==1.26.4 nvidia-cublas-cu12==12.8.4.1 nvidia-cuda-cupti-cu12==12.8.90 nvidia-cuda-nvrtc-cu12==12.8.93 nvidia-cuda-runtime-cu12==12.8.90 nvidia-cudnn-cu12==9.10.2.21 nvidia-cufft-cu12==11.3.3.83 nvidia-cufile-cu12==1.13.1.3 nvidia-curand-cu12==10.3.9.90 nvidia-cusolver-cu12==11.7.3.90 nvidia-cusparse-cu12==12.5.8.93 nvidia-cusparselt-cu12==0.7.1 nvidia-ml-py==13.580.82 nvidia-nccl-cu12==2.27.5 nvidia-nvjitlink-cu12==12.8.93 nvidia-nvshmem-cu12==3.3.20 nvidia-nvtx-cu12==12.8.90 omegaconf==2.3.0 orjson==3.11.3 packaging==25.0 pandas==2.3.3 peft==0.17.1 pillow==11.3.0 platformdirs==4.5.0 pooch==1.8.2 propcache==0.4.1 protobuf==6.33.0 psutil==7.1.1 py-cpuinfo==9.0.0 pyarrow==21.0.0 pycparser==2.23 pydantic==2.10.6 pydantic_core==2.27.2 pydub==0.25.1 Pygments==2.19.2 pyparsing==3.2.5 python-dateutil==2.9.0.post0 python-multipart==0.0.20 pytz==2025.2 PyYAML==6.0.3 regex==2025.10.22 requests==2.32.5 rich==14.2.0 rouge-chinese==1.0.3 ruff==0.14.1 safehttpx==0.1.6 safetensors==0.5.3 scikit-learn==1.7.2 scipy==1.16.2 semantic-version==2.10.0 sentencepiece==0.2.1 shellingham==1.5.4 shtab==1.7.2 six==1.17.0 sniffio==1.3.1 soundfile==0.13.1 soxr==1.0.0 sse-starlette==3.0.2 starlette==0.48.0 sympy==1.14.0 tensorboard==2.20.0 tensorboard-data-server==0.7.2 termcolor==3.1.0 threadpoolctl==3.6.0 tiktoken==0.12.0 tokenizers==0.22.1 tomlkit==0.13.3 torch==2.9.0 torchvision==0.24.0 tqdm==4.67.1 transformers==4.57.1 triton==3.5.0 trl==0.9.6 typer==0.20.0 typing_extensions==4.15.0 tyro==0.8.14 tzdata==2025.2 urllib3==2.5.0 uvicorn==0.38.0 websockets==15.0.1 Werkzeug==3.1.3 xxhash==3.6.0 yarl==1.22.0 运行脚本 model_name_or_path: /hdd/wangty/model/Qwen2.5-VL-3B-Instruct trust_remote_code: true image_max_pixels: 262144 video_max_pixels: 16384
stage: sft deepspeed: /hdd/wangty/new_task/LLaMA-Factory/examples/deepspeed/ds_z2_config.json do_train: true finetuning_type: lora lora_rank: 128 lora_alpha: 256 lora_dropout: 0.05 lora_target: all freeze_multi_modal_projector: false freeze_vision_tower: false
dataset_dir: /hdd/wangty/new_task/LLaMA-Factory/task/dataset/qwen/zyzg dataset: zyzg_sag+tra_crop_train template: qwen2_vl cutoff_len: 2048 max_samples: 50000 overwrite_cache: true preprocessing_num_workers: 16 dataloader_num_workers: 8 dataloader_pin_memory: true dataloader_persistent_workers: false
output_dir: /hdd/wangty/new_task/LLaMA-Factory/task/work_dirs/qwen2.5vl_3b/zyzg/test_gpu logging_steps: 10 save_strategy: "best" load_best_model_at_end: true metric_for_best_model: 'eval_zyzg_sag+tra_crop_val_accuracy' save_total_limit: 1 overwrite_output_dir: true report_to: tensorboard
per_device_train_batch_size: 4 gradient_accumulation_steps: 1 learning_rate: 5.0e-5 num_train_epochs: 10.0 lr_scheduler_type: cosine warmup_ratio: 0.05 weight_decay: 0.0 bf16: true gradient_checkpointing: true ddp_timeout: 180000000
eval_dataset: zyzg_sag+tra_crop_val per_device_eval_batch_size: 4 eval_strategy: epoch do_sample: false compute_accuracy: true
qwen3版本只是替换了模型和模板,其他设置没有区别 训练时间: 训练的图像数据是分辨率很低的图片 qwen3-vl-4B:30h gpu利用率在30左右 qwen2.5-vl-3B:1h16min gpu利用率正常
一样的环境和启动方法,为什么我的会出现这个错误,你们当时出现过吗,是怎么解决的?方便帮忙看一下吗 ValueError: Processor was not found, please check and update your model file.
transformers 版本落后了
requests 2.32.5 rich 14.2.0 ruff 0.14.1 safehttpx 0.1.6 safetensors 0.5.3 scikit-learn 1.7.2 scipy 1.16.2 semantic-version 2.10.0 sentencepiece 0.2.1 setuptools 80.9.0 shellingham 1.5.4 shtab 1.7.2 six 1.17.0 sniffio 1.3.1 soundfile 0.13.1 soxr 1.0.0 sse-starlette 3.0.2 starlette 0.48.0 sympy 1.14.0 termcolor 3.1.0 threadpoolctl 3.6.0 tiktoken 0.12.0 tokenizers 0.22.1 tomlkit 0.13.3 torch 2.9.0 tqdm 4.67.1 transformers 4.57.1 triton 3.5.0 trl 0.9.6 typer 0.19.2 typing_extensions 4.15.0 tyro 0.8.14 tzdata 2025.2 urllib3 2.5.0 uvicorn 0.38.0 websockets 15.0.1 wheel 0.45.1 xxhash 3.6.0 yarl 1.22.0 但是我的transformers已经更新到了4.57.1,并且llamafactory也是最新拉取的
在 2025-10-22 11:18:23,"mengzmd" @.***> 写道:
mengzmd left a comment (hiyouga/LLaMA-Factory#9282)
一样的环境和启动方法,为什么我的会出现这个错误,你们当时出现过吗,是怎么解决的?方便帮忙看一下吗 ValueError: Processor was not found, please check and update your model file.
transformers 版本落后了
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
requests 2.32.5 rich 14.2.0 ruff 0.14.1 safehttpx 0.1.6 safetensors 0.5.3 scikit-learn 1.7.2 scipy 1.16.2 semantic-version 2.10.0 sentencepiece 0.2.1 setuptools 80.9.0 shellingham 1.5.4 shtab 1.7.2 six 1.17.0 sniffio 1.3.1 soundfile 0.13.1 soxr 1.0.0 sse-starlette 3.0.2 starlette 0.48.0 sympy 1.14.0 termcolor 3.1.0 threadpoolctl 3.6.0 tiktoken 0.12.0 tokenizers 0.22.1 tomlkit 0.13.3 torch 2.9.0 tqdm 4.67.1 transformers 4.57.1 triton 3.5.0 trl 0.9.6 typer 0.19.2 typing_extensions 4.15.0 tyro 0.8.14 tzdata 2025.2 urllib3 2.5.0 uvicorn 0.38.0 websockets 15.0.1 wheel 0.45.1 xxhash 3.6.0 yarl 1.22.0 但是我的transformers已经更新到了4.57.1,并且llamafactory也是最新拉取的
在 2025-10-22 11:18:23,"mengzmd" @.***> 写道:
mengzmd left a comment (hiyouga/LLaMA-Factory#9282)
一样的环境和启动方法,为什么我的会出现这个错误,你们当时出现过吗,是怎么解决的?方便帮忙看一下吗 ValueError: Processor was not found, please check and update your model file.
transformers 版本落后了
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
same issue, have you sloved?
model
model_name_or_path: /root/liguochun/models/Qwen/Qwen3-VL-8B-Instruct image_max_pixels: 589824 video_max_pixels: 16384 trust_remote_code: true
method
stage: sft do_train: true finetuning_type: lora lora_rank: 8 lora_target: all
stage: sft
do_train: true
finetuning_type: full
freeze_vision_tower: true
freeze_multi_modal_projector: true
freeze_language_model: false
deepspeed: examples/deepspeed/ds_z3_config.json
dataset
dataset: click_0904_0916_train template: qwen3_vl cutoff_len: 4096 max_samples: 4000 overwrite_cache: true preprocessing_num_workers: 16 dataloader_num_workers: 4
output
output_dir: saves/Qwen3-VL-8B-Instruct/LR1.0e-4_Rank8_epoch20_batch64 logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true save_only_model: false report_to: none # choices: [none, wandb, tensorboard, swanlab, mlflow]
train
per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 1.0e-4 num_train_epochs: 20.0 lr_scheduler_type: cosine warmup_ratio: 0.1
bf16: true
pure_bf16: true ddp_timeout: 180000000 resume_from_checkpoint: null
eval
val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500
use_swanlab: true swanlab_run_name: Qwen3-VL-8B-Instruct/LR1.0e-4_Rank8_epoch20_batch64 # 可选 swanlab_api_key: xxxx
Package Version Editable project location
accelerate 1.11.0 aiofiles 24.1.0 aiohappyeyeballs 2.6.1 aiohttp 3.13.1 aiosignal 1.4.0 annotated-types 0.7.0 antlr4-python3-runtime 4.9.3 anyio 4.11.0 async-timeout 5.0.1 attrs 25.4.0 audioread 3.0.1 av 16.0.1 boto3 1.40.55 botocore 1.40.55 Brotli 1.1.0 certifi 2025.10.5 cffi 2.0.0 charset-normalizer 3.4.4 click 8.3.0 contourpy 1.3.2 cycler 0.12.1 datasets 4.0.0 decorator 5.2.1 dill 0.3.8 docstring_parser 0.17.0 einops 0.8.1 exceptiongroup 1.3.0 fastapi 0.119.1 ffmpy 0.6.3 filelock 3.20.0 fire 0.7.1 fonttools 4.60.1 frozenlist 1.8.0 fsspec 2025.3.0 gradio 5.45.0 gradio_client 1.13.0 groovy 0.1.2 h11 0.16.0 hf_transfer 0.1.9 hf-xet 1.1.10 httpcore 1.0.9 httpx 0.28.1 huggingface-hub 0.35.3 idna 3.11 jieba 0.42.1 Jinja2 3.1.6 jmespath 1.0.1 joblib 1.5.2 kiwisolver 1.4.9 lazy_loader 0.4 librosa 0.11.0 llamafactory 0.9.4.dev0 /root/liguochun/LLaMA-Factory-main llvmlite 0.45.1 markdown-it-py 4.0.0 MarkupSafe 3.0.3 matplotlib 3.10.7 mdurl 0.1.2 modelscope 1.31.0 mpmath 1.3.0 msgpack 1.1.2 multidict 6.7.0 multiprocess 0.70.16 networkx 3.4.2 nltk 3.9.2 numba 0.62.1 numpy 1.26.4 nvidia-cublas-cu12 12.8.4.1 nvidia-cuda-cupti-cu12 12.8.90 nvidia-cuda-nvrtc-cu12 12.8.93 nvidia-cuda-runtime-cu12 12.8.90 nvidia-cudnn-cu12 9.10.2.21 nvidia-cufft-cu12 11.3.3.83 nvidia-cufile-cu12 1.13.1.3 nvidia-curand-cu12 10.3.9.90 nvidia-cusolver-cu12 11.7.3.90 nvidia-cusparse-cu12 12.5.8.93 nvidia-cusparselt-cu12 0.7.1 nvidia-ml-py 13.580.82 nvidia-nccl-cu12 2.27.5 nvidia-nvjitlink-cu12 12.8.93 nvidia-nvshmem-cu12 3.3.20 nvidia-nvtx-cu12 12.8.90 nvitop 1.5.3 omegaconf 2.3.0 orjson 3.11.3 packaging 25.0 pandas 2.3.3 peft 0.17.1 pillow 11.3.0 pip 25.2 platformdirs 4.5.0 pooch 1.8.2 prettytable 3.16.0 propcache 0.4.1 protobuf 6.33.0 psutil 7.1.1 pyarrow 21.0.0 pycparser 2.23 pydantic 2.10.6 pydantic_core 2.27.2 pydub 0.25.1 pyecharts 2.0.9 Pygments 2.19.2 pyparsing 3.2.5 python-dateutil 2.9.0.post0 python-multipart 0.0.20 pytz 2025.2 PyYAML 6.0.3 regex 2025.10.22 requests 2.32.5 rich 13.9.4 rouge-chinese 1.0.3 ruff 0.14.1 s3transfer 0.14.0 safehttpx 0.1.6 safetensors 0.5.3 scikit-learn 1.7.2 scipy 1.15.3 semantic-version 2.10.0 sentencepiece 0.2.1 setuptools 80.9.0 shellingham 1.5.4 shtab 1.7.2 simplejson 3.20.2 six 1.17.0 sniffio 1.3.1 soundfile 0.13.1 soxr 1.0.0 sse-starlette 3.0.2 starlette 0.48.0 swanlab 0.6.12 sympy 1.14.0 termcolor 3.1.0 threadpoolctl 3.6.0 tiktoken 0.12.0 tokenizers 0.22.1 tomlkit 0.13.3 torch 2.9.0 torchvision 0.24.0 tqdm 4.67.1 transformers 4.57.1 triton 3.5.0 trl 0.9.6 typer 0.20.0 typing_extensions 4.15.0 tyro 0.8.14 tzdata 2025.2 urllib3 2.5.0 uvicorn 0.38.0 wcwidth 0.2.14 websockets 15.0.1 wheel 0.45.1 wrapt 2.0.0 xxhash 3.6.0 yarl 1.22.0
同样的问题,2张a100 40G, 跑DPO带lora,1000个sample Qwen2.5VL-7b 大概十分钟, qwen3VL-8b需要四小时
System Info gpu:2×A800 cuda:11.8 python3.10 Package Version
absl-py 1.3.0 annoy 1.17.3 apex 0.1 appdirs 1.4.4 argon2-cffi 21.3.0 argon2-cffi-bindings 21.2.0 args 0.1.0 asttokens 2.2.1 astunparse 1.6.3 atari-py 0.2.9 attrs 22.1.0 audioread 3.0.0 backcall 0.2.0 beautifulsoup4 4.11.1 bleach 5.0.1 blis 0.7.9 box2d-py 2.3.8 cachetools 5.2.0 catalogue 2.0.8 certifi 2022.12.7 cffi 1.15.1 chardet 3.0.4 charset-normalizer 2.1.1 click 8.1.3 clint 0.5.1 cloudpickle 2.2.0 cmake 3.24.1.1 comm 0.2.1 confection 0.0.3 contourpy 1.0.6 cuda-python 11.7.0+0.g95a2041.dirty cudf 22.10.0a0+316.gad1ba132d2.dirty cugraph 22.10.0a0+113.g6bbdadf8.dirty cuml 22.10.0a0+56.g3a8dea659.dirty cupy-cuda111 12.3.0 cupy-cuda118 11.0.0 cycler 0.11.0 cymem 2.0.7 Cython 0.29.32 cytoolz 0.12.2 dask 2022.9.2 dask-cuda 22.10.0a0+23.g62a1ee8 dask-cudf 22.10.0a0+316.gad1ba132d2.dirty dbus-python 1.2.16 debugpy 1.6.4 decorator 5.1.1 defusedxml 0.7.1 distlib 0.3.8 distributed 2022.9.2 distro 1.4.0 dlib 19.24.2 entrypoints 0.4 exceptiongroup 1.0.4 execnet 1.9.0 executing 1.2.0 expecttest 0.1.3 fastjsonschema 2.16.2 fastrlock 0.8.1 filelock 3.13.1 fonttools 4.38.0 fsspec 2022.11.0 funcsigs 1.0.2 google-auth 2.15.0 google-auth-oauthlib 0.4.6 gpg 1.13.1 graphsurgeon 0.4.6 grpcio 1.51.1 gym 0.26.2 gym-notices 0.0.8 gym-retro 0.8.0 HeapDict 1.0.1 hypothesis 5.35.1 idna 3.4 imagecodecs 2023.3.16 importlib-metadata 5.1.0 importlib-resources 5.10.1 incremental 22.10.0 iniconfig 1.1.1 intel-openmp 2021.4.0 ipykernel 6.19.2 ipython 8.7.0 ipython-genutils 0.2.0 ipywidgets 8.1.1 jedi 0.18.2 jellyfish 1.0.3 Jinja2 3.1.2 joblib 1.2.0 json5 0.9.10 jsonschema 4.17.3 jupyter_client 7.4.8 jupyter_core 5.1.0 jupyter-tensorboard 0.2.0 jupyterlab 2.3.2 jupyterlab-pygments 0.2.2 jupyterlab-server 1.2.0 jupyterlab-widgets 3.0.9 jupytext 1.14.4 kaggle 1.5.16 kiwisolver 1.4.4 langcodes 3.3.0 librosa 0.9.2 llvmlite 0.39.1 locket 1.0.0 Markdown 3.4.1 markdown-it-py 2.1.0 MarkupSafe 2.1.1 marshmallow 3.20.1 matplotlib 3.6.2 matplotlib-inline 0.1.6 mdit-py-plugins 0.3.3 mdurl 0.1.2 menpo 0.11.0 mistune 2.0.4 mkl 2021.1.1 mkl-devel 2021.1.1 mkl-include 2021.1.1 mock 4.0.3 mpmath 1.2.1 msgpack 1.0.4 murmurhash 1.0.9 nbclient 0.7.2 nbconvert 7.2.6 nbformat 5.7.0 nest-asyncio 1.5.6 networkx 2.6.3 notebook 6.4.10 numba 0.56.4 numpy 1.22.2 nvidia-dali-cuda110 1.20.0 nvidia-pyindex 1.0.9 nvtx 0.2.5 oauthlib 3.2.2 onnx 1.12.0 opencv 4.6.0 packaging 22.0 pandas 1.5.2 pandocfilters 1.5.0 parso 0.8.3 partd 1.3.0 path 16.9.0 path.py 12.5.0 pathlib2 2.3.7.post1 pathy 0.10.1 pbr 6.0.0 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.2.0 pip 21.2.4 pkgutil_resolve_name 1.3.10 platformdirs 4.1.0 plotly 5.18.0 pluggy 1.0.0 polygraphy 0.43.1 pooch 1.6.0 preshed 3.0.8 prettytable 3.5.0 prometheus-client 0.15.0 prompt-toolkit 3.0.36 protobuf 3.20.1 psutil 5.9.4 ptyprocess 0.7.0 pure-eval 0.2.2 pyarrow 9.0.0 pyasn1 0.4.8 pyasn1-modules 0.2.8 pybind11 2.10.1 pycocotools 2.0+nv0.7.1 pycparser 2.21 pycrypto 2.6.1 pydantic 1.10.2 pyemd 1.0.0 pyglet 1.5.28 Pygments 2.13.0 PyGObject 3.36.0 pylibcugraph 22.10.0a0+113.g6bbdadf8.dirty pylibraft 22.10.0a0+81.g08abc72.dirty pynvml 11.4.1 pynvrtc 9.2 PyOpenGL 3.1.7 PyOpenGL-accelerate 3.1.7 pyparsing 3.0.9 pyphen 0.14.0 pyrsistent 0.19.2 pytest 7.2.0 pytest-rerunfailures 10.3 pytest-shard 0.1.2 pytest-xdist 3.1.0 python-dateutil 2.8.2 python-hostlist 1.22 python-slugify 8.0.1 pytorch-quantization 2.1.2 pytz 2022.6 PyYAML 6.0 pyzmq 24.0.1 qtconsole 5.5.1 QtPy 2.4.1 raft-dask 22.10.0a0+81.g08abc72.dirty raven 6.10.0 regex 2022.10.31 requests 2.28.1 requests-oauthlib 1.3.1 requests-toolbelt 1.0.0 resampy 0.4.2 retrowrapper 0.3.0 rmm 22.10.0a0+38.ge043158.dirty rsa 4.9 scikit-learn 0.24.2 scipy 1.6.3 seaborn 0.13.1 Send2Trash 1.8.0 setuptools 59.5.0 six 1.16.0 smart-open 6.3.0 sortedcontainers 2.4.0 soundfile 0.11.0 soupsieve 2.3.2.post1 spacy 3.4.4 spacy-legacy 3.0.10 spacy-loggers 1.0.4 sphinx-glpi-theme 0.3 srsly 2.4.5 ssh-import-id 5.10 stack-data 0.6.2 sympy 1.11.1 tbb 2021.7.1 tblib 1.7.0 tenacity 8.2.3 tensorboard 2.9.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 tensorrt 8.5.1.7 terminado 0.17.1 text-unidecode 1.3 thinc 8.1.5 threadpoolctl 3.1.0 tifffile 2023.7.10 tinycss2 1.2.1 toml 0.10.2 tomli 2.0.1 toolz 0.12.0 torch 1.14.0a0+410ce96 torch-tensorrt 1.3.0a0 torchtext 0.13.0a0+fae8e8c torchvision 0.15.0a0 tornado 6.4 tqdm 4.64.1 traitlets 5.7.1 transformer-engine 0.3.0 treelite 2.4.0 treelite-runtime 2.4.0 typer 0.7.0 typing_extensions 4.4.0 ucx-py 0.27.0a0+29.ge9e81f8 uff 0.6.9 urllib3 1.26.13 virtualenv 20.25.0 visdom 0.2.4 wasabi 0.10.1 wcwidth 0.2.5 webencodings 0.5.1 websocket-client 1.7.0 Werkzeug 2.2.2 wheel 0.38.4 widgetsnbextension 4.0.9 xdoctest 1.0.2 xgboost 1.6.2 zict 2.2.0 zipp 3.11.0 zmq 0.0.0 训练参数
model
model_name_or_path: /dfs/data/qwen3_vl_8b image_max_pixels: 1048576
method
stage: sft do_train: true finetuning_type: full freeze_vision_tower: false freeze_multi_modal_projector: false freeze_language_model: false deepspeed: examples/deepspeed/ds_z3_config.json # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]
dataset
dataset: lingbujian_fuza_prompt template: qwen3_vl cutoff_len: 2048 max_samples: 30000 overwrite_cache: true preprocessing_num_workers: 16
output
output_dir: saves/qwen3vl/full/sft save_steps: 500 plot_loss: true overwrite_output_dir: true
train
per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 1.0e-5 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000
eval
per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 2000 eval_dataset: lingbujian_val compute_accuracy: true 训练参数和qwen2.5_vl一致, 数据集任务为对图像分类,答案为类别,训练数据2.4W,验证数据0.7W, 同样数据和训练参数在使用qwen2.5_vl全参微调时大约为160h
在训练到9h后停止了训练进度不再更新,并且gpu的其中一个占用率0%
requests 2.32.5 rich 14.2.0 ruff 0.14.1 safehttpx 0.1.6 safetensors 0.5.3 scikit-learn 1.7.2 scipy 1.16.2 semantic-version 2.10.0 sentencepiece 0.2.1 setuptools 80.9.0 shellingham 1.5.4 shtab 1.7.2 six 1.17.0 sniffio 1.3.1 soundfile 0.13.1 soxr 1.0.0 sse-starlette 3.0.2 starlette 0.48.0 sympy 1.14.0 termcolor 3.1.0 threadpoolctl 3.6.0 tiktoken 0.12.0 tokenizers 0.22.1 tomlkit 0.13.3 torch 2.9.0 tqdm 4.67.1 transformers 4.57.1 triton 3.5.0 trl 0.9.6 typer 0.19.2 typing_extensions 4.15.0 tyro 0.8.14 tzdata 2025.2 urllib3 2.5.0 uvicorn 0.38.0 websockets 15.0.1 wheel 0.45.1 xxhash 3.6.0 yarl 1.22.0 但是我的transformers已经更新到了4.57.1,并且llamafactory也是最新拉取的
在 2025-10-22 11:18:23,"mengzmd" @.***> 写道:
mengzmd left a comment (hiyouga/LLaMA-Factory#9282)
一样的环境和启动方法,为什么我的会出现这个错误,你们当时出现过吗,是怎么解决的?方便帮忙看一下吗 ValueError: Processor was not found, please check and update your model file.
transformers 版本落后了
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.*** 你少torchvision,安装试试
安装llama-factory的requirements也出现了同样训练很慢的问题, 重新安装了QwenVL官方的环境, 训练速度快了很多, GPU利用率90%以上了
安装llama-factory的requirements也出现了同样训练很慢的问题, 重新安装了QwenVL官方的环境, 训练速度快了很多, GPU利用率90%以上了
是qwen官方的finetune代码吗
安装llama-factory的requirements也出现了同样训练很慢的问题, 重新安装了QwenVL官方的环境, 训练速度快了很多, GPU利用率90%以上了
老哥方便说一下具体是怎么做的吗?非常感谢!
The same slow training problem occurred when installing llama-factory's requirements. After reinstalling the QwenVL official environment, the training speed was much faster and the GPU utilization rate was over 90%.
Could you please tell me how to do it? Thank you very much!
@sean-xr He referred to this (https://github.com/QwenLM/Qwen3-VL/tree/main/qwen-vl-finetune) I suppose, it helped me too.
The same slow training problem occurred when installing llama-factory's requirements. After reinstalling the QwenVL official environment, the training speed was much faster and the GPU utilization rate was over 90%.
Could you please tell me how to do it? Thank you very much!
@sean-xr He referred to this (https://github.com/QwenLM/Qwen3-VL/tree/main/qwen-vl-finetune) I suppose, it helped me too.
Thanks!
The same slow training problem occurred when installing llama-factory's requirements. After reinstalling the QwenVL official environment, the training speed was much faster and the GPU utilization rate was over 90%.
Could you please tell me how to do it? Thank you very much!
@sean-xr He referred to this (https://github.com/QwenLM/Qwen3-VL/tree/main/qwen-vl-finetune) I suppose, it helped me too.
大佬,您是使用那个框架进行的训练吗而不是使用LLama factory了吗?
The same slow training problem occurred when installing llama-factory's requirements. After reinstalling the QwenVL official environment, the training speed was much faster and the GPU utilization rate was over 90%.
Could you please tell me how to do it? Thank you very much!
@sean-xr He referred to this (https://github.com/QwenLM/Qwen3-VL/tree/main/qwen-vl-finetune) I suppose, it helped me too.
大佬,您是使用那个框架进行的训练吗而不是使用LLama factory了吗?
No you could still use llama-facctory.
What I did is very simple: in a new conda environment, first install the depencies suggested by Qwen3-VL:
torch==2.6.0
torchvision==0.21.0
transformers==4.57.0.dev0
deepspeed==0.17.1
flash_attn==2.7.4.post1
triton==3.2.0
accelerate==1.7.0
torchcodec==0.2
peft==0.17.1
(Note: I did not install flash_attn and deepspeed, and used transformers==4.57.1)
Then you could proceed and install Llama-factory's dependencies by:
cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation
This works for me and solves my training speed issue for Qwen3VL.
The same slow training problem occurred when installing llama-factory's requirements. After reinstalling the QwenVL official environment, the training speed was much faster and the GPU utilization rate was over 90%.
Could you please tell me how to do it? Thank you very much!
@sean-xr He referred to this (https://github.com/QwenLM/Qwen3-VL/tree/main/qwen-vl-finetune) I suppose, it helped me too.
大佬,您是使用那个框架进行的训练吗而不是使用LLama factory了吗?
No you could still use llama-facctory.
What I did is very simple: in a new conda environment, first install the depencies suggested by Qwen3-VL:
torch==2.6.0 torchvision==0.21.0 transformers==4.57.0.dev0 deepspeed==0.17.1 flash_attn==2.7.4.post1 triton==3.2.0 accelerate==1.7.0 torchcodec==0.2 peft==0.17.1(Note: I did not install flash_attn and deepspeed, and used transformers==4.57.1)
Then you could proceed and install Llama-factory's dependencies by:
cd LLaMA-Factory pip install -e ".[torch,metrics]" --no-build-isolationThis works for me and solves my training speed issue for Qwen3VL.
good!This also works for me about training speed
ok ok !thank you !
---- 回复的原邮件 ---- | 发件人 | @.> | | 发送日期 | 2025年10月30日 00:00 | | 收件人 | hiyouga/LLaMA-Factory @.> | | 抄送人 | kanqgg @.>, Comment @.> | | 主题 | Re: [hiyouga/LLaMA-Factory] Qwen3-VL-8B和4B训练时候的GPU使用率低下 (Issue #9282) | nuo1nuo left a comment (hiyouga/LLaMA-Factory#9282)
The same slow training problem occurred when installing llama-factory's requirements. After reinstalling the QwenVL official environment, the training speed was much faster and the GPU utilization rate was over 90%.
Could you please tell me how to do it? Thank you very much!
@sean-xr He referred to this (https://github.com/QwenLM/Qwen3-VL/tree/main/qwen-vl-finetune) I suppose, it helped me too.
大佬,您是使用那个框架进行的训练吗而不是使用LLama factory了吗?
No you could still use llama-facctory.
What I did is very simple: in a new conda environment, first install the depencies suggested by Qwen3-VL:
torch==2.6.0 torchvision==0.21.0 transformers==4.57.0.dev0 deepspeed==0.17.1 flash_attn==2.7.4.post1 triton==3.2.0 accelerate==1.7.0 torchcodec==0.2 peft==0.17.1
(Note: I did not install flash_attn and deepspeed, and used transformers==4.57.1)
Then you could proceed and install Llama-factory's dependencies by:
cd LLaMA-Factory pip install -e ".[torch,metrics]" --no-build-isolation
This works for me and solves my training speed issue for Qwen3VL.
good!This also works for me about training speed
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
使用qwen3-vl官方的qwen-vl-finetune依赖貌似可以很好解决blackwell以下架构GPU的问题,但是对于blackwell架构的GPU,它支持的pytorch最小版本只能到2.7.1,所以我把torch从2.9.0降级到2.7.1,torchvision降级到0.22.1,然后其它依赖使用的版本和qwen-vl-finetune一致,但问题依旧。