[Bug] 使用lmdeploy推理internvl2-40B出错
Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
- [X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
ERROR:asyncio:Exception in callback _raise_exception_on_finish(<Future finis...sertions.\n')>) at /root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py:19
handle: <Handle _raise_exception_on_finish(<Future finis...sertions.\n')>) at /root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py:19>
Traceback (most recent call last):
File "/opt/conda/envs/python3.10.13/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 26, in _raise_exception_on_finish
raise e
File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 22, in _raise_exception_on_finish
task.result()
File "/opt/conda/envs/python3.10.13/lib/python3.10/asyncio/futures.py", line 201, in result
raise self._exception.with_traceback(self._exception_tb)
File "/opt/conda/envs/python3.10.13/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 151, in forward
outputs = self.model.forward(inputs)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/model/internvl.py", line 172, in forward
return self._forward_func(images)
File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/model/internvl.py", line 153, in _forward_v1_5
outputs = self.model.extract_feature(outputs)
File "/root/.cache/huggingface/modules/transformers_modules/InternVL2-40B/modeling_internvl_chat.py", line 176, in extract_feature
vit_embeds = self.vision_model(
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/InternVL2-40B/modeling_intern_vit.py", line 418, in forward
encoder_outputs = self.encoder(
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/InternVL2-40B/modeling_intern_vit.py", line 354, in forward
layer_outputs = encoder_layer(
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 363, in pre_forward
return send_to_device(args, self.execution_device), send_to_device(
File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 174, in send_to_device
return honor_type(
File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 81, in honor_type
return type(obj)(generator)
File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 175, in TORCH_USE_CUDA_DSA to enable device-side assertions.
Reproduction
rt
Environment
rt
Error traceback
ERROR:asyncio:Exception in callback _raise_exception_on_finish(<Future finis...sertions.\n')>) at /root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py:19
handle: <Handle _raise_exception_on_finish(<Future finis...sertions.\n')>) at /root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py:19>
Traceback (most recent call last):
File "/opt/conda/envs/python3.10.13/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 26, in _raise_exception_on_finish
raise e
File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 22, in _raise_exception_on_finish
task.result()
File "/opt/conda/envs/python3.10.13/lib/python3.10/asyncio/futures.py", line 201, in result
raise self._exception.with_traceback(self._exception_tb)
File "/opt/conda/envs/python3.10.13/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 151, in forward
outputs = self.model.forward(inputs)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/model/internvl.py", line 172, in forward
return self._forward_func(images)
File "/root/.local/lib/python3.10/site-packages/lmdeploy/vl/model/internvl.py", line 153, in _forward_v1_5
outputs = self.model.extract_feature(outputs)
File "/root/.cache/huggingface/modules/transformers_modules/InternVL2-40B/modeling_internvl_chat.py", line 176, in extract_feature
vit_embeds = self.vision_model(
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/InternVL2-40B/modeling_intern_vit.py", line 418, in forward
encoder_outputs = self.encoder(
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/InternVL2-40B/modeling_intern_vit.py", line 354, in forward
layer_outputs = encoder_layer(
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
File "/root/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 363, in pre_forward
return send_to_device(args, self.execution_device), send_to_device(
File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 174, in send_to_device
return honor_type(
File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 81, in honor_type
return type(obj)(generator)
File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 175, in <genexpr>
tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor)
File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 155, in send_to_device
return tensor.to(device, non_blocking=non_blocking)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
用tp了么?
启动程序之前,export CUDA_LAUNCH_BLOCKING=1,先设置环境变量,然后再跑的结果如何呢?
用tp了么?
启动程序之前,
export CUDA_LAUNCH_BLOCKING=1,先设置环境变量,然后再跑的结果如何呢?
用了tp=4,A100,不用的话模型放不下,加了之后还是一样的错误
创建pipeline / server的时候,cache_max_entry_count 设成0.1来减少kvcache的用量试试看的。vision的部分复用的上游的代码,感觉出问题的概率不太大,这里怀疑可能是显存不足导致的,模型启动后的剩余显存有多少呢。
创建pipeline / server的时候,cache_max_entry_count 设成0.1来减少kvcache的用量试试看的。vision的部分复用的上游的代码,感觉出问题的概率不太大,这里怀疑可能是显存不足导致的,模型启动后的剩余显存有多少呢。
已解决,4张A100 tp==4出错,但是2张tp=2可以
我觉得不算解决,并不清楚原因是什么
我觉得不算解决,并不清楚原因是什么
会不会是tp数不同,模型split策略不同导致的
感觉不是,方便的话,可以试下在这个镜像里面会不会报错。 https://hub.docker.com/r/openmmlab/lmdeploy/tags
我遇到了同样的问题,单机一张3090一张2080ti 22g。以下是环境信息 sys.platform: linux Python: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 2147483648 GPU 0: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda-12.1 NVCC: Cuda compilation tools, release 12.1, V12.1.66 GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 PyTorch: 2.2.2+cu121 PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 12.1
- NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
- CuDNN 8.9.2
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,
TorchVision: 0.17.2+cu121 LMDeploy: 0.5.3+9f3e748 transformers: 4.42.4 gradio: 3.50.2 fastapi: 0.111.1 pydantic: 2.8.2 triton: 2.2.0 NVIDIA Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X N/A
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
同样的问题,会偶发在这段卡主:
File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 155, in send_to_device
return tensor.to(device, non_blocking=non_blocking)
trace到是在accelerate send_to_device函数没有返回