lmdeploy Balance vision model weights on multi gpus

TODO

[x] hangs issue when using nccl(turbomind).
[x] docs & cli

https://github.com/InternLM/lmdeploy/issues/1563

May 14 '24 07:05 irexyc

@lzhangzz waiting for your review...

May 21 '24 09:05 buaadf

非常期待这个功能

May 21 '24 13:05 rTrQqgH74lc2PT5k

现在的 tp 相当于只要 CUDA_VISIBLE_DEVICES 可访问的 GPU 都会用吗？即使指定了 tp==2, 也会用四卡，如果四卡均可访问

是的。

May 22 '24 08:05 irexyc

runtime.txt 中要明确下 accelerate的最低版本

May 23 '24 03:05 lvhan028

大佬，请问在执行python文件时 File "D:\新建文件夹\InternDog-master\app_cli.py", line 3, in from agent.model import chat File "D:\新建文件夹\InternDog-master\agent\model.py", line 2, in from lmdeploy import turbomind as tm File "C:\Users\86186\anaconda3\envs\pythonProject2\Lib\site-packages\lmdeploy\turbomind_init_.py", line 24, in from .turbomind import TurboMind # noqa: E402 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\86186\anaconda3\envs\pythonProject2\Lib\site-packages\lmdeploy\turbomind\turbomind.py", line 26, in from .deploy.converter import (get_model_format, supported_formats, File "C:\Users\86186\anaconda3\envs\pythonProject2\Lib\site-packages\lmdeploy\turbomind\deploy\converter.py", line 16, in from .target_model.base import OUTPUT_MODELS, TurbomindModelConfig File "C:\Users\86186\anaconda3\envs\pythonProject2\Lib\site-packages\lmdeploy\turbomind\deploy\target_model_init_.py", line 3, in from .w4 import TurbomindW4Model # noqa: F401 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\86186\anaconda3\envs\pythonProject2\Lib\site-packages\lmdeploy\turbomind\deploy\target_model\w4.py", line 17, in import _turbomind as _tm # noqa: E402 ^^^^^^^^^^^^^^^^^^^^^^^^ ImportError: DLL load failed while importing _turbomind: 找不到指定的模块。这样报错是有甚麽问题吗

May 23 '24 10:05 covdvoyager

@covdvoyager

可以看下这个是否能帮到你。 https://github.com/InternLM/lmdeploy/issues/1146#issuecomment-2101845391

May 23 '24 11:05 irexyc

@irexyc 请问多卡并行必须要2的幂次张卡吗，我这里用3张A30跑不起来

May 24 '24 01:05 buaadf

@buaadf

backend_config 里面的 tp 需要 2的幂次。

May 24 '24 01:05 irexyc

@buaadf

backend_config 里面的 tp 需要 2的幂次。

请问 tp的设置和卡数有什么关系吗，A30（24G）至少需要几张才能跑起来呀？

May 24 '24 01:05 buaadf

@buaadf

LM 模型切分 tp 需要是2的幂次。tp=2就是说LM需要两块卡，会从可见的卡里面选择0,1号卡。

如果你跑的是VLM模型，backend_config tp设置2，CUDA_VISIBLE_DEVICES=“0,1,2”，那么 vision 模型会均分到三块卡上，LM模型会均分到前两块卡上。

能不能跑看你跑的是什么模型。就权重来说（不量化），7b的模型，大概需要14G的显存。20b的模型需要40G的显存。除了模型的显存外，kv cache 也需要显存，会影响 session_len 以及 batch 的大小。可以通过 cache_max_entry_count 来控制大小。

May 24 '24 01:05 irexyc

tp=2的情况下，双4090卡仍然无法运行int8版本的InternVL（25G权重文件），显存占用会爆掉。求赐教。

(internvl) yushen@YuShen-Work:~/ai/InternVL$ python gradio_InternVL.py Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Exception in thread Thread-3 (_create_weight_func): Traceback (most recent call last): File "/home/yushen/micromamba/envs/internvl/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/home/yushen/micromamba/envs/internvl/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/home/yushen/micromamba/envs/internvl/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 199, in _create_weight_func model_comm.create_shared_weights(device_id, rank) RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:32

Exception in thread Thread-5 (_get_params): Traceback (most recent call last): File "/home/yushen/micromamba/envs/internvl/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/home/yushen/micromamba/envs/internvl/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/home/yushen/micromamba/envs/internvl/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 215, in _get_params out = model_comm.get_params(device_id, rank) RuntimeError: [TM][ERROR] Assertion fail: /lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModel.cc:417

May 27 '24 03:05 ysyx2008

@ysyx2008

我们不支持加载bnb形式的int8模型。请用我们的量化工具进行量化。这个工具应该在0.4.2中可以使用

这个是针对 LLM 的量化文档，VLM 也是通用的，把DEMO中的模型换成VLM模型就好。

近期也会有一篇针对 VLM量化的文章发布，可以关注一下。

May 27 '24 03:05 irexyc

@ysyx2008

我们不支持加载bnb形式的int8模型。请用我们的量化工具进行量化。这个工具应该在0.4.2中可以使用

这个是针对 LLM 的量化文档，VLM 也是通用的，把DEMO中的模型换成VLM模型就好。

近期也会有一篇针对 VLM量化的文章发布，可以关注一下。

非常感谢，之前自行量化报错，刚发现pip默认安装的是0.4.1版本，我再去研究更新到0.4.2再试一次。再次感谢。

May 27 '24 05:05 ysyx2008

@irexyc error with internlm/internlm-xcomposer2-4khd-7b model

Dummy Resized
Traceback (most recent call last):
  File "/opt/py38/bin/lmdeploy", line 11, in <module>
    load_entry_point('lmdeploy', 'console_scripts', 'lmdeploy')()
  File "/opt/lmdeploy/lmdeploy/cli/entrypoint.py", line 37, in run
    args.run(args)
  File "/opt/lmdeploy/lmdeploy/cli/serve.py", line 303, in api_server
    run_api_server(args.model_path,
  File "/opt/lmdeploy/lmdeploy/serve/openai/api_server.py", line 1191, in serve
    VariableInterface.async_engine = pipeline_class(
  File "/opt/lmdeploy/lmdeploy/serve/vl_async_engine.py", line 20, in __init__
    self.vl_encoder = ImageEncoder(model_path, vision_config)
  File "/opt/lmdeploy/lmdeploy/vl/engine.py", line 69, in __init__
    self.model = load_vl_model(model_path)
  File "/opt/lmdeploy/lmdeploy/vl/model/builder.py", line 40, in load_vl_model
    return Xcomposer2VisionModel(model_path, with_llm)
  File "/opt/lmdeploy/lmdeploy/vl/model/xcomposer2.py", line 42, in __init__
    self.build_model()
  File "/opt/lmdeploy/lmdeploy/vl/model/xcomposer2.py", line 76, in build_model
    max_memory = get_balanced_memory(
UnboundLocalError: local variable 'get_balanced_memory' referenced before assignment

May 27 '24 06:05 sshuair

@sshuair

shoud be fixed in https://github.com/InternLM/lmdeploy/pull/1661

May 27 '24 07:05 irexyc

curious to know, is VLM pipeline support persistent batching? @irexyc

Jun 24 '24 13:06 serser

vision均分是tp还是pp? @irexyc

Jul 23 '24 08:07 Pass-O-Guava

lmdeploy lmdeploy copied to clipboard

Balance vision model weights on multi gpus

TODO

lmdeploy
lmdeploy copied to clipboard