lmdeploy [Bug] 使用lmdeploy推理internvl2-40B出错

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

ERROR:asyncio:Exception in callback _raise_exception_on_finish(<Future finis...sertions.\n')>) at /root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py:19 handle: <Handle _raise_exception_on_finish(<Future finis...sertions.\n')>) at /root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py:19> Traceback (most recent call last): File "/opt/conda/envs/python3.10.13/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 26, in _raise_exception_on_finish raise e File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 22, in _raise_exception_on_finish task.result() File "/opt/conda/envs/python3.10.13/lib/python3.10/asyncio/futures.py", line 201, in result raise self._exception.with_traceback(self._exception_tb) File "/opt/conda/envs/python3.10.13/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 151, in forward outputs = self.model.forward(inputs) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/model/internvl.py", line 172, in forward return self._forward_func(images) File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/model/internvl.py", line 153, in _forward_v1_5 outputs = self.model.extract_feature(outputs) File "/root/.cache/huggingface/modules/transformers_modules/InternVL2-40B/modeling_internvl_chat.py", line 176, in extract_feature vit_embeds = self.vision_model( File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/InternVL2-40B/modeling_intern_vit.py", line 418, in forward encoder_outputs = self.encoder( File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/InternVL2-40B/modeling_intern_vit.py", line 354, in forward layer_outputs = encoder_layer( File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs) File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 363, in pre_forward return send_to_device(args, self.execution_device), send_to_device( File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 174, in send_to_device return honor_type( File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 81, in honor_type return type(obj)(generator) File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 175, in tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor) File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 155, in send_to_device return tensor.to(device, non_blocking=non_blocking) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Reproduction

rt

Environment

rt

Error traceback

ERROR:asyncio:Exception in callback _raise_exception_on_finish(<Future finis...sertions.\n')>) at /root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py:19
handle: <Handle _raise_exception_on_finish(<Future finis...sertions.\n')>) at /root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py:19>
Traceback (most recent call last):
  File "/opt/conda/envs/python3.10.13/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 26, in _raise_exception_on_finish
    raise e
  File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 22, in _raise_exception_on_finish
    task.result()
  File "/opt/conda/envs/python3.10.13/lib/python3.10/asyncio/futures.py", line 201, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 151, in forward
    outputs = self.model.forward(inputs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/model/internvl.py", line 172, in forward
    return self._forward_func(images)
  File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/model/internvl.py", line 153, in _forward_v1_5
    outputs = self.model.extract_feature(outputs)
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL2-40B/modeling_internvl_chat.py", line 176, in extract_feature
    vit_embeds = self.vision_model(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL2-40B/modeling_intern_vit.py", line 418, in forward
    encoder_outputs = self.encoder(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL2-40B/modeling_intern_vit.py", line 354, in forward
    layer_outputs = encoder_layer(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 363, in pre_forward
    return send_to_device(args, self.execution_device), send_to_device(
  File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 174, in send_to_device
    return honor_type(
  File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 81, in honor_type
    return type(obj)(generator)
  File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 175, in <genexpr>
    tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor)
  File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 155, in send_to_device
    return tensor.to(device, non_blocking=non_blocking)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Aug 08 '24 09:08 hitzhu

用tp了么？

启动程序之前，export CUDA_LAUNCH_BLOCKING=1，先设置环境变量，然后再跑的结果如何呢？

Aug 08 '24 09:08 irexyc

用tp了么？

启动程序之前，export CUDA_LAUNCH_BLOCKING=1，先设置环境变量，然后再跑的结果如何呢？

用了tp=4,A100，不用的话模型放不下，加了之后还是一样的错误

Aug 08 '24 11:08 hitzhu

创建pipeline / server的时候，cache_max_entry_count 设成0.1来减少kvcache的用量试试看的。vision的部分复用的上游的代码，感觉出问题的概率不太大，这里怀疑可能是显存不足导致的，模型启动后的剩余显存有多少呢。

Aug 08 '24 11:08 irexyc

创建pipeline / server的时候，cache_max_entry_count 设成0.1来减少kvcache的用量试试看的。vision的部分复用的上游的代码，感觉出问题的概率不太大，这里怀疑可能是显存不足导致的，模型启动后的剩余显存有多少呢。

已解决，4张A100 tp==4出错，但是2张tp=2可以

Aug 08 '24 17:08 hitzhu

我觉得不算解决，并不清楚原因是什么

Aug 09 '24 03:08 irexyc

我觉得不算解决，并不清楚原因是什么

会不会是tp数不同,模型split策略不同导致的

Aug 09 '24 03:08 hitzhu

感觉不是，方便的话，可以试下在这个镜像里面会不会报错。 https://hub.docker.com/r/openmmlab/lmdeploy/tags

Aug 09 '24 05:08 irexyc

我遇到了同样的问题,单机一张3090一张2080ti 22g。以下是环境信息 sys.platform: linux Python: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 2147483648 GPU 0: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda-12.1 NVCC: Cuda compilation tools, release 12.1, V12.1.66 GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 PyTorch: 2.2.2+cu121 PyTorch compiling details: PyTorch built with:

GCC 9.3
C++ Version: 201703
Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 12.1
NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
CuDNN 8.9.2
Magma 2.6.1
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.17.2+cu121 LMDeploy: 0.5.3+9f3e748 transformers: 4.42.4 gradio: 3.50.2 fastapi: 0.111.1 pydantic: 2.8.2 triton: 2.2.0 NVIDIA Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

Aug 09 '24 10:08 haoduoyu1203

同样的问题，会偶发在这段卡主：

File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 155, in send_to_device
    return tensor.to(device, non_blocking=non_blocking)

trace到是在accelerate send_to_device函数没有返回

Aug 29 '24 01:08 DefTruth