lmdeploy 当连续请求200多次后，出现突然卡住的情况

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

使用阿里官方的qwen2-72b-instruct-awq模型，每隔0.3s发送一次请求，每次请求大约3570tokens，开始260条很快就有返回，但是到了260多条请求后，就会突然卡住，需要等很长一段时间才有返回，期间没有出现错误，显示GPU利用率99%，请问这是什么原因？ 1722848281486

Reproduction

我的启动语句是： lmdeploy serve api_server /workspace/qwen/Qwen2-72B-Instruct-AWQ --server-port 6005 --tp 2 --model-name [model_name] --cache-max-entry-count 0.8

这是我部分代码：

def llm_result(query):
    json_data2 = {
        'model': [model_name],
        'messages': [
           #所有的content加起来约3570tokens
            {
                'role':'system',
                'content':'xxx' 
            },
            {
                'role': 'user',
                'content': f'''xxx'''
            }
        ],
    }
    response = requests.post('http://[ip]:6005/v1/chat/completions', headers=headers, json=json_data2)
    text = json.loads(response.text)
    message = text["choices"][0]["message"]["content"]
    return message

def main():
	file="abc.xlsx"
	excel_file = os.path.join(dirs, file)
	df = pd.read_excel(excel_file)
	datas = df.values
	for data in datas:
		content=data[5]
		message=llm_result(content)
		time.sleep(0.3)
		print(message)

Environment

sys.platform: linux
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1: NVIDIA A100-SXM4-40GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 2.1.0
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.16.0
LMDeploy: 0.5.1+unknown
transformers: 4.42.4
gradio: 4.38.1
fastapi: 0.111.1
pydantic: 2.8.2
triton: 2.1.0
NVIDIA Topology: 
	GPU0	GPU1	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV12	0-35,72-107	0		N/A
GPU1	NV12	 X 	0-35,72-107	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Error traceback

No response

Aug 05 '24 08:08 lai-serena

我也遇到了一样的问题，也是qwen2-72b, 我的请求的tokens数量大概是你的1/3，我做了限制，但是也同样是在你3倍的请求后（900）个开始挂起。不知是否解决了？

Aug 07 '24 09:08 ChunyiY

而且我发现我跑了7b也遇到了一样的问题

Aug 07 '24 09:08 ChunyiY

同样的问题

Aug 08 '24 16:08 bltcn

same 2*A100 26B

Aug 26 '24 03:08 fanghostt

@zhulinJulia24 could you help try to reproduce this issue?

Aug 28 '24 15:08 lvhan028

同样的问题

Aug 29 '24 02:08 DefTruth

@lvhan028 感觉这个问题是个大bug，vl的模型，我用着经常会遇到这个偶发卡住的问题。没有报错，就是hang住不返回。看着像是accelerate和lmdeploy的集合通信互相死锁了，因为请求是异步发出的，vit的推理和llm的推理实际上是流水线重叠的。trace的日志：

-- Stack for thread 23201439544896 ---
  File "/usr/lib/python3.10/threading.py", line 973, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 83, in _worker
    work_item.run()
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/engine.py", line 108, in forward
    outputs = self.model.forward(inputs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/model/internvl.py", line 174, in forward
    return self._forward_func(images)
  File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/model/internvl.py", line 155, in _forward_v1_5
    outputs = self.model.extract_feature(outputs)
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_internvl_chat.py", line 216, in extract_feature
    vit_embeds = self.vision_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_intern_vit.py", line 418, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_intern_vit.py", line 354, in forward
    layer_outputs = encoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 164, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 363, in pre_forward
    return send_to_device(args, self.execution_device), send_to_device(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 174, in send_to_device
    return honor_type(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 81, in honor_type
    return type(obj)(generator)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 175, in <genexpr>
    tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 155, in send_to_device
    return tensor.to(device, non_blocking=non_blocking)

Aug 29 '24 02:08 DefTruth

@lvhan028 感觉这个问题是个大bug，vl的模型，我用着经常会遇到这个偶发卡住的问题。没有报错，就是hang住不返回。看着像是accelerate和lmdeploy的集合通信互相死锁了，因为请求是异步发出的，vit的推理和llm的推理实际上是流水线重叠的。trace的日志：

-- Stack for thread 23201439544896 ---
  File "/usr/lib/python3.10/threading.py", line 973, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 83, in _worker
    work_item.run()
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/engine.py", line 108, in forward
    outputs = self.model.forward(inputs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/model/internvl.py", line 174, in forward
    return self._forward_func(images)
  File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/model/internvl.py", line 155, in _forward_v1_5
    outputs = self.model.extract_feature(outputs)
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_internvl_chat.py", line 216, in extract_feature
    vit_embeds = self.vision_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_intern_vit.py", line 418, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_intern_vit.py", line 354, in forward
    layer_outputs = encoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 164, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 363, in pre_forward
    return send_to_device(args, self.execution_device), send_to_device(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 174, in send_to_device
    return honor_type(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 81, in honor_type
    return type(obj)(generator)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 175, in <genexpr>
    tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 155, in send_to_device
    return tensor.to(device, non_blocking=non_blocking)

请问有找到解决办法吗？还挺经常出现的

Aug 29 '24 02:08 lai-serena

@irexyc may follow up this issue

Aug 29 '24 03:08 lvhan028

@lai-serena @DefTruth

可以减少一些kvcache的占用(--cache-max-entry-count 0.4 或者更少），预留更多的显存buffer(比如5个G)，再观察一下么?

关于99%不返回的问题，最好在server测，启动的时候加上日志 (--log-level INFO)，之前有遇到过个别请求无法停止造成生成阶段耗时很长的情况。

Aug 29 '24 03:08 irexyc

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

使用阿里官方的qwen2-72b-instruct-awq模型，每隔0.3s发送一次请求，每次请求大约3570tokens，开始260条很快就有返回，但是到了260多条请求后，就会突然卡住，需要等很长一段时间才有返回，期间没有出现错误，显示GPU利用率99%，请问这是什么原因？ 1722848281486

Reproduction

我的启动语句是： lmdeploy serve api_server /workspace/qwen/Qwen2-72B-Instruct-AWQ --server-port 6005 --tp 2 --model-name [model_name] --cache-max-entry-count 0.8

这是我部分代码：

def llm_result(query):
    json_data2 = {
        'model': [model_name],
        'messages': [
           #所有的content加起来约3570tokens
            {
                'role':'system',
                'content':'xxx' 
            },
            {
                'role': 'user',
                'content': f'''xxx'''
            }
        ],
    }
    response = requests.post('http://[ip]:6005/v1/chat/completions', headers=headers, json=json_data2)
    text = json.loads(response.text)
    message = text["choices"][0]["message"]["content"]
    return message

def main():
	file="abc.xlsx"
	excel_file = os.path.join(dirs, file)
	df = pd.read_excel(excel_file)
	datas = df.values
	for data in datas:
		content=data[5]
		message=llm_result(content)
		time.sleep(0.3)
		print(message)

Environment

sys.platform: linux
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1: NVIDIA A100-SXM4-40GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 2.1.0
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.16.0
LMDeploy: 0.5.1+unknown
transformers: 4.42.4
gradio: 4.38.1
fastapi: 0.111.1
pydantic: 2.8.2
triton: 2.1.0
NVIDIA Topology: 
	GPU0	GPU1	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV12	0-35,72-107	0		N/A
GPU1	NV12	 X 	0-35,72-107	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Error traceback

No response

I cannot reproduce it on A100 80G

My script is :

import requests
import json
import time
def llm_result(query):
    json_data2 = {
        'model': 'qwen2',
        'messages': [
            {
                'role':'system',
                'content':''
            },
            {
                'role': 'user',
                'content': query
            }
        ],
    }
    headers = {'Content-Type':'application/json'}
    response = requests.post('http://0.0.0.0:6005/v1/chat/completions', headers=headers, json=json_data2)
    text = json.loads(response.text)
    message = text["choices"][0]["message"]["content"]
    return message

datas = ["你好,你是谁"*1000]*600

for data in datas:
    content=data
    #print(content)
    start_time = time.time()
    message=llm_result(content)
    end_time = time.time()
    task_duration_seconds = round(end_time - start_time, 2)
    time.sleep(0.3)
    print(task_duration_seconds)

the input content is 4000tokens. And the time consumed for each response is about 1-3s. can you try to add --cache-max-entry-count 0.4 when start the api server.

Aug 29 '24 05:08 zhulinJulia24

使用qwen2.5-72B-Instruct的时候发现了同样的问题，请求两三千次后就会卡住

Nov 26 '24 09:11 ChenZiHong-Gavin

InternVL2-8B遇到相同的问题

Dec 18 '24 02:12 nzomi

@ChenZiHong-Gavin @nzomi

可以把 https://github.com/InternLM/lmdeploy/blob/v0.6.4/lmdeploy/vl/engine.py#L26-L27 这里面的 raise e 改成 sys.exit(1) 试试么，看看是vision 部分有问题还是 llm 部分有问题。或者用最新的版本 (0.6.4) 和 pytorch backend 来跑跑看，看下是否还是卡主。

Dec 18 '24 02:12 irexyc

@ChenZiHong-Gavin @nzomi

可以把 https://github.com/InternLM/lmdeploy/blob/v0.6.4/lmdeploy/vl/engine.py#L26-L27 这里面的 raise e 改成 sys.exit(1) 试试么，看看是vision 部分有问题还是 llm 部分有问题。或者用最新的版本 (0.6.4) 和 pytorch backend 来跑跑看，看下是否还是卡主。

我目前测试下来似乎是高并发导致后续请求的推理耗时逐渐增加，超过一个阈值就卡死了 @irexyc 特别是有些图片的结果比较糟糕(比如不停的重复输出一直到max_token这种情况)，后续卡死的概率更高，话说能提前检测到这种异常输出么？

Dec 19 '24 02:12 nzomi

@nzomi

因为提特征会比较耗时。如果一次收到的请求过多，那么后面的请求会等前面的处理完才开始处理，那么靠后的请求完成时间就会比较长。

卡死的话，需要明确一下现象。可以把 server 侧的日志打开 INFO 级别就可以，如果发请求，server测的日志不输出了，算卡死。如果server测一直有输出，client侧重复输出，这个不算卡死。

前者的话，我们目前正在尝试复现，后者的话可以把 max_token 可以设小一点缓解。

Dec 19 '24 06:12 irexyc

@nzomi

因为提特征会比较耗时。如果一次收到的请求过多，那么后面的请求会等前面的处理完才开始处理，那么靠后的请求完成时间就会比较长。

卡死的话，需要明确一下现象。可以把 server 侧的日志打开 INFO 级别就可以，如果发请求，server测的日志不输出了，算卡死。如果server测一直有输出，client侧重复输出，这个不算卡死。

前者的话，我们目前正在尝试复现，后者的话可以把 max_token 可以设小一点缓解。

@irexyc client端迟迟收不到结果，给server发送请求能从log看到接收到了新请求，所以不算卡死只是单纯处理得比较慢是吧。这样一来，后续请求什么时候收到响应感觉就是未知的，还得看之前有没有累积请求？

Dec 19 '24 08:12 nzomi

@nzomi

log level 设置为 INFO，然后看这种开头的日志是不是持续打印 [TM][INFO]

Dec 19 '24 08:12 irexyc

@irexyc 我部署后用grpc做了个中转，倒是没能看到[TM][INFO] 这个现象。我在服务端自己写的异步，限制semaphore的数量可以缓解超时这个问题，但是平均每次推理的耗时大幅增加了哦。如果收到很多请求，这些请求是按先后顺序排队的吗？后续发的一定得等前面的结束才会开始吗？

Dec 23 '24 03:12 nzomi

卡住的问题，internvl 团队复现了，有稳定的复现脚本，正在解决中。

Dec 23 '24 10:12 lvhan028

卡住的问题，internvl 团队复现了，有稳定的复现脚本，正在解决中。

您好！想问一下大概是什么问题导致的呢，最近我们也遇到了，团队的脚本可以借鉴一下么因为我们也不能稳定复现

Jul 16 '25 09:07 Skyseaee

抱歉，忘记更新现状。 transformers推理 vision部分和 turbomind 推理 llm 部分，怀疑因为通信卡住了。https://github.com/InternLM/lmdeploy/pull/3126 给了解决方案，但性能损失比较大。所以我们没有合入。推荐改用 pytorch engine

Jul 16 '25 10:07 lvhan028

可以考虑vision走gloo cpu通信？这样就不会和nccl后端冲突，也不会卡住

---- 回复的原邮件 ---- | 发件人 | Lyu @.> | | 日期 | 2025年07月16日 18:27 | | 收件人 | @.> | | 抄送至 | @.>@.> | | 主题 | Re: [InternLM/lmdeploy] 当连续请求200多次后，出现突然卡住的情况 (Issue #2231) | lvhan028 left a comment (InternLM/lmdeploy#2231)

抱歉，忘记更新现状。 transformers推理 vision部分和 turbomind 推理 llm 部分，怀疑因为通信卡住了。#3126 给了解决方案，但性能损失比较大。所以我们没有合入。推荐改用 pytorch engine

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Jul 16 '25 10:07 DefTruth

This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.

Jul 24 '25 03:07 github-actions[bot]

This issue is closed because it has been stale for 5 days. Please open a new issue if you have similar issues or you have any new updates now.

Jul 29 '25 03:07 github-actions[bot]

lmdeploy lmdeploy copied to clipboard

当连续请求200多次后，出现突然卡住的情况

Checklist

Describe the bug

Reproduction

Environment

Error traceback

Checklist

Describe the bug

Reproduction

Environment

Error traceback

lmdeploy
lmdeploy copied to clipboard