lmdeploy icon indicating copy to clipboard operation
lmdeploy copied to clipboard

support qwen2-vl with turbomind backend

Open irexyc opened this issue 1 year ago • 20 comments

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories? If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

irexyc avatar Nov 06 '24 03:11 irexyc

Postpone the review it untill @irexyc refactor tm's attention module cc @AllentDan @lzhangzz

lvhan028 avatar Nov 08 '24 06:11 lvhan028

Any updates?

serser avatar Dec 27 '24 04:12 serser

Besides the error in p1 = p2 = p3 = (int)timestep - mrope_position_delta_, the current branch produces incorrect results during batch inference. @irexyc

chenzhengda avatar Jan 20 '25 09:01 chenzhengda

Any updates?

randomseed713 avatar Feb 19 '25 09:02 randomseed713

A PR based on the current code will be submitted this week

irexyc avatar Feb 19 '25 09:02 irexyc

A PR based on the current code will be submitted this week

Will it also support Qwen2.5VL?

Juniper1021 avatar Feb 20 '25 06:02 Juniper1021

A PR based on the current code will be submitted this week

Will it also support Qwen2.5VL?

as I saw in modification of lmdeploy/turbomind/supported_models.py there will not support Qwen2_5_VLForConditionalGeneration architecutres :/

piotr-sikora-v avatar Feb 20 '25 09:02 piotr-sikora-v

qwen2.5-vl will be supported by pytorch engine.

lvhan028 avatar Feb 20 '25 14:02 lvhan028

waiting for demo of inference with qwen2-vl with turbomind backend

quanfeifan avatar Feb 23 '25 08:02 quanfeifan

waiting for demo of inference with qwen2-vl with turbomind backend

what's more, any plan to support qwen2-vl quantized with awq, w4a16 with turbomind?

quanfeifan avatar Feb 23 '25 13:02 quanfeifan

@xiaoxiangshusheng @chenzhengda Thanks for pointing out the bug. It should be addition and the timestep calculation of mrope seems wrong. I hope the new pr fixes it.

For image input, there is no difference between qwen2_5-vl and qwen2-vl. Therefore I added some mapping to support it.

This branch supports qwen2_5-vl inference with turbomind backend and quantization with lmdeploy lite api. This branch is developed based on another branch, so it may not be merged quickly. You can try this release. @randomseed713 @Juniper1021 @piotr-sikora-v @quanfeifan

irexyc avatar Feb 25 '25 04:02 irexyc

@xiaoxiangshusheng @chenzhengda Thanks for pointing out the bug. It should be addition and the timestep calculation of mrope seems wrong. I hope the new pr fixes it.

For image input, there is no difference between qwen2_5-vl and qwen2-vl. Therefore I added some mapping to support it.

This branch supports qwen2_5-vl inference with turbomind backend and quantization with lmdeploy lite api. This branch is developed based on another branch, so it may not be merged quickly. You can try this release. @randomseed713 @Juniper1021 @piotr-sikora-v @quanfeifan

Thanks for sharing!

I found this error:

2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi flush_l1d
Virtualization:                       VT-x
L1d cache:                            896 KiB (28 instances)
L1i cache:                            896 KiB (28 instances)
L2 cache:                             7 MiB (28 instances)
L3 cache:                             70 MiB (2 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-13,28-41
NUMA node1 CPU(s):                    14-27,42-55
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          KVM: Mitigation: VMX disabled
Vulnerability L1tf:                   Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:                    Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
Vulnerability Meltdown:               Mitigation; PTI
Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[rank0]:   File "/root/lmdeploy-0.7.0.post3-qwen2.5-vl/lmdeploy/pytorch/engine/model_agent.py", line 642, in _build_model
[rank0]:     model, cache_engine, cache_config = _tp_build_model(
[rank0]:                                         ^^^^^^^^^^^^^^^^
[rank0]:   File "/root/lmdenv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/lmdeploy-0.7.0.post3-qwen2.5-vl/lmdeploy/pytorch/engine/model_agent.py", line 343, in _tp_build_model
[rank0]:     raise e
[rank0]:   File "/root/lmdeploy-0.7.0.post3-qwen2.5-vl/lmdeploy/pytorch/engine/model_agent.py", line 320, in _tp_build_model
[rank0]:     patched_model = build_patched_model(model_config, device=device_map)
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/lmdenv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/lmdeploy-0.7.0.post3-qwen2.5-vl/lmdeploy/pytorch/models/patch.py", line 204, in build_patched_model
[rank0]:     return build_model_from_hf_config(model_config, dtype=dtype, device=device)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/lmdeploy-0.7.0.post3-qwen2.5-vl/lmdeploy/pytorch/models/patch.py", line 194, in build_model_from_hf_config
[rank0]:     model_cls = _get_model_class(model_config, module_map)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/lmdeploy-0.7.0.post3-qwen2.5-vl/lmdeploy/pytorch/models/patch.py", line 184, in _get_model_class
[rank0]:     raise RuntimeError(f'Can not found rewrite for architectures: {architectures}')
[rank0]: RuntimeError: Can not found rewrite for architectures: ['Qwen2_5_VLForConditionalGeneration']
[rank0]:[W225 09:24:17.484472721 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

I know there was some changes in model and architectures... my latest configuration works with vLLM release and it's match to latest changes in qwen-2.5-vl

piotr-sikora-v avatar Feb 25 '25 09:02 piotr-sikora-v

@piotr-sikora-v It seems you are using pytorch backend which is not supported now. You can try using TurbomindEngineConfig backend config. If you still encounter problems, you can provide reproducible code

irexyc avatar Feb 25 '25 09:02 irexyc

@piotr-sikora-v It seems you are using pytorch backend which is not supported now. You can try using TurbomindEngineConfig backend config. If you still encounter problems, you can provide reproducible code

I'am running it from CLI with setting backend to turbomind. But I don't know why is falling back to pytorch I build it using your release and then pip install -e .

No errors

...
Requirement already satisfied: six>=1.5 in /root/lmdenv/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas->datasets->outlines<0.1.0->lmdeploy==0.7.0.post3) (1.17.0)
Building wheels for collected packages: lmdeploy
  Building editable for lmdeploy (pyproject.toml) ... done
  Created wheel for lmdeploy: filename=lmdeploy-0.7.0.post3-0.editable-py3-none-any.whl size=12668 sha256=0677ea3d15cc5ff3cb7df91227fcaac23e2eeca8ea5712616584327c4bc79bf2
  Stored in directory: /tmp/pip-ephem-wheel-cache-b_p080vj/wheels/34/17/c2/7b396938fa7c074d4d5a12e9b171b3e1e3d09d1f65f742809e
Successfully built lmdeploy
Installing collected packages: lmdeploy
  Attempting uninstall: lmdeploy
    Found existing installation: lmdeploy 0.7.0.post3
    Uninstalling lmdeploy-0.7.0.post3:
      Successfully uninstalled lmdeploy-0.7.0.post3
Successfully installed lmdeploy-0.7.0.post3

here is my full command:

lmdeploy serve api_server  --dtype float16 
--cache-max-entry-count 0.1 --max-concurrent-requests 16  --log-level INFO    --enable-prefix-caching  --mod
el-name /root/model_qwen3_cmon3 --tp 4 --server-port 8000 --backend turbomind  /root/model_qwen3_cmon3-vllm-
latest/ 
2025-02-25 09:58:07,712 - lmdeploy - WARNING - archs.py:57 - Fallback to pytorch engine because turbomind en
gine is not installed correctly. If you insist to use turbomind engine, you may need to reinstall lmdeploy from pypi or build from source and try again.
2025-02-25 09:58:07,712 - lmdeploy - WARNING - archs.py:62 - Try to run with pytorch engine because `/root/m
odel_qwen3_cmon3-vllm-latest/` is not explicitly supported by lmdeploy.
2025-02-25 09:58:10,215 - lmdeploy - INFO - builder.py:63 - matching vision model: Qwen2VLModel
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fas
t=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will 
result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
2025-02-25 09:58:11,862 - lmdeploy - INFO - async_engine.py:260 - input backend=pytorch, backend_config=Pyto
rchEngineConfig(dtype='float16', tp=4, session_len=None, max_batch_size=128, cache_max_entry_count=0.1, pref
ill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=819
2, thread_safe=False, enable_prefix_caching=True, device_type='cuda', eager_mode=False, custom_module_map=No
ne, download_dir=None, revision=None, quant_policy=0)
2025-02-25 09:58:11,863 - lmdeploy - INFO - async_engine.py:261 - input chat_template_config=None
2025-02-25 09:58:11,870 - lmdeploy - INFO - async_engine.py:270 - updated chat_template_onfig=ChatTemplateCo
nfig(model_name='qwen', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None,
 eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
2025-02-25 09:58:13,923 - lmdeploy - WARNING - transformers.py:22 - LMDeploy requires transformers version: 
[4.33.0 ~ 4.46.1], but found version: 4.50.0.dev0

piotr-sikora-v avatar Feb 25 '25 10:02 piotr-sikora-v

I build it using your release and then pip install -e .

pip install -e . won't build turobmind backend. You can install the wheel package from here

irexyc avatar Feb 25 '25 10:02 irexyc

I build it using your release and then pip install -e .

pip install -e . won't build turobmind backend. You can install the wheel package from here

Great! it's works! I think it's 20% faster than on vLLM, but I need do some benchmarks.

I don't know why yet, but sometimes it freezes while generating... it's possible it's because of my configuration.

piotr-sikora-v avatar Feb 25 '25 11:02 piotr-sikora-v

after one hour of running I got crash

2025-02-25 13:48:25,776 - lmdeploy - INFO - async_engine.py:675 - session=120, history_tokens=0, input_tokens=2028, max_new_tokens=4096, seq_start=True, seq_end=True, step=0, prep=True
2025-02-25 13:48:25,776 - lmdeploy - INFO - turbomind.py:560 - [async_stream_infer] session 120 start
[TM][INFO] [ProcessInferRequests] Request for 120 received.
[TM][INFO] [Forward] [0, 1), dc=0, pf=1, sum_q=2028, sum_k=2028, max_q=2028, max_k=2028
2025-02-25 13:48:25,848 - lmdeploy - ERROR - async_engine.py:592 - [safe_run] exception caught: OverflowError out of range integral type conversion attempted
2025-02-25 13:48:25,849 - lmdeploy - INFO - turbomind.py:622 - [async_stream_infer] GeneratorExit
terminate called after throwing an instance of 'std::runtime_error'
  what():  [TM][ERROR] CUDA runtime error: an illegal memory access was encountered /lmdeploy/src/turbomind/models/llama/unified_decoder.cc:265 

On log I saw that memory on GPU was increasing all time.

command:

lmdeploy serve api_server  --dtype float16 --cache-max-entry-count 0.1 --max-concurrent-requests 16  --log-level INFO    --enable-prefix-caching  --tp 4 --server-port 8000 --backend turbomind Qwen/Qwen2.5-VL-3B-Instruct

System: 4x V100 SXM2

BTW. I don't have metrics in lmdeploy so I only can compare with avarage time on my jobs on vLLM it was 10.52s on lmdeploy it was 9.03 Any hint to get better value with this hardware and model?

piotr-sikora-v avatar Feb 25 '25 13:02 piotr-sikora-v

@piotr-sikora-v

Thanks for your feedback, I will check this.

For better performance, you can quant the model according to this doc

irexyc avatar Feb 26 '25 02:02 irexyc

Any updates on this?

santapo avatar Apr 21 '25 16:04 santapo

@irexyc 我也遇到了这个问题

2025-02-25 13:48:25,848 - lmdeploy - ERROR - async_engine.py:592 - [safe_run] exception caught: OverflowError out of range integral type conversion attempted 2025-02-25 13:48:25,849 - lmdeploy - INFO - turbomind.py:622 - [async_stream_infer] GeneratorExit terminate called after throwing an instance of 'std::runtime_error' what(): [TM][ERROR] CUDA runtime error: an illegal memory access was encountered /lmdeploy/src/turbomind/models/llama/unified_decoder.cc:265

看起来是根据python的日志,发现是解码的时候,出现词表越界的token。就是出现不在词表中的token。

我尝试在python代码中try 住这个bug,但是会发生C++的core的错误。请问,如果想要try住这个问题,应该如何操作

zzf-damon avatar May 08 '25 03:05 zzf-damon

The latest main has supported Qwen2-VL and Qwen2.5-VL in turbomind engine. v0.10.0 will be released soon.

lvhan028 avatar Sep 08 '25 04:09 lvhan028