lmdeploy
lmdeploy copied to clipboard
当连续请求200多次后,出现突然卡住的情况
Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
- [X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
使用阿里官方的qwen2-72b-instruct-awq模型,每隔0.3s发送一次请求,每次请求大约3570tokens,开始260条很快就有返回,但是到了260多条请求后,就会突然卡住,需要等很长一段时间才有返回,期间没有出现错误,显示GPU利用率99%,请问这是什么原因?
Reproduction
我的启动语句是:
lmdeploy serve api_server /workspace/qwen/Qwen2-72B-Instruct-AWQ --server-port 6005 --tp 2 --model-name [model_name] --cache-max-entry-count 0.8
这是我部分代码:
def llm_result(query):
json_data2 = {
'model': [model_name],
'messages': [
#所有的content加起来约3570tokens
{
'role':'system',
'content':'xxx'
},
{
'role': 'user',
'content': f'''xxx'''
}
],
}
response = requests.post('http://[ip]:6005/v1/chat/completions', headers=headers, json=json_data2)
text = json.loads(response.text)
message = text["choices"][0]["message"]["content"]
return message
def main():
file="abc.xlsx"
excel_file = os.path.join(dirs, file)
df = pd.read_excel(excel_file)
datas = df.values
for data in datas:
content=data[5]
message=llm_result(content)
time.sleep(0.3)
print(message)
Environment
sys.platform: linux
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1: NVIDIA A100-SXM4-40GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 2.1.0
PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX512
- CUDA Runtime 11.8
- NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37
- CuDNN 8.7
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.16.0
LMDeploy: 0.5.1+unknown
transformers: 4.42.4
gradio: 4.38.1
fastapi: 0.111.1
pydantic: 2.8.2
triton: 2.1.0
NVIDIA Topology:
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 0-35,72-107 0 N/A
GPU1 NV12 X 0-35,72-107 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Error traceback
No response
我也遇到了一样的问题,也是qwen2-72b, 我的请求的tokens数量大概是你的1/3,我做了限制,但是也同样是在你3倍的请求后(900)个开始挂起。不知是否解决了?
而且我发现我跑了7b也遇到了一样的问题
同样的问题
same 2*A100 26B
@zhulinJulia24 could you help try to reproduce this issue?
同样的问题
@lvhan028 感觉这个问题是个大bug,vl的模型,我用着经常会遇到这个偶发卡住的问题。没有报错,就是hang住不返回。看着像是accelerate和lmdeploy的集合通信互相死锁了,因为请求是异步发出的,vit的推理和llm的推理实际上是流水线重叠的。trace的日志:
-- Stack for thread 23201439544896 ---
File "/usr/lib/python3.10/threading.py", line 973, in _bootstrap
self._bootstrap_inner()
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 83, in _worker
work_item.run()
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/engine.py", line 108, in forward
outputs = self.model.forward(inputs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/model/internvl.py", line 174, in forward
return self._forward_func(images)
File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/model/internvl.py", line 155, in _forward_v1_5
outputs = self.model.extract_feature(outputs)
File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_internvl_chat.py", line 216, in extract_feature
vit_embeds = self.vision_model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_intern_vit.py", line 418, in forward
encoder_outputs = self.encoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_intern_vit.py", line 354, in forward
layer_outputs = encoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 164, in new_forward
args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 363, in pre_forward
return send_to_device(args, self.execution_device), send_to_device(
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 174, in send_to_device
return honor_type(
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 81, in honor_type
return type(obj)(generator)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 175, in <genexpr>
tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 155, in send_to_device
return tensor.to(device, non_blocking=non_blocking)
@lvhan028 感觉这个问题是个大bug,vl的模型,我用着经常会遇到这个偶发卡住的问题。没有报错,就是hang住不返回。看着像是accelerate和lmdeploy的集合通信互相死锁了,因为请求是异步发出的,vit的推理和llm的推理实际上是流水线重叠的。trace的日志:
-- Stack for thread 23201439544896 --- File "/usr/lib/python3.10/threading.py", line 973, in _bootstrap self._bootstrap_inner() File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/usr/lib/python3.10/concurrent/futures/thread.py", line 83, in _worker work_item.run() File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/engine.py", line 108, in forward outputs = self.model.forward(inputs) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/model/internvl.py", line 174, in forward return self._forward_func(images) File "/usr/local/lib/python3.10/dist-packages/lmdeploy/vl/model/internvl.py", line 155, in _forward_v1_5 outputs = self.model.extract_feature(outputs) File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_internvl_chat.py", line 216, in extract_feature vit_embeds = self.vision_model( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_intern_vit.py", line 418, in forward encoder_outputs = self.encoder( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_intern_vit.py", line 354, in forward layer_outputs = encoder_layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 164, in new_forward args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 363, in pre_forward return send_to_device(args, self.execution_device), send_to_device( File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 174, in send_to_device return honor_type( File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 81, in honor_type return type(obj)(generator) File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 175, in <genexpr> tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor) File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 155, in send_to_device return tensor.to(device, non_blocking=non_blocking)
请问有找到解决办法吗?还挺经常出现的
@irexyc may follow up this issue
@lai-serena @DefTruth
可以减少一些kvcache的占用(--cache-max-entry-count 0.4 或者更少),预留更多的显存buffer(比如5个G),再观察一下么?
关于99%不返回的问题,最好在server测,启动的时候加上日志 (--log-level INFO),之前有遇到过个别请求无法停止造成生成阶段耗时很长的情况。
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
使用阿里官方的qwen2-72b-instruct-awq模型,每隔0.3s发送一次请求,每次请求大约3570tokens,开始260条很快就有返回,但是到了260多条请求后,就会突然卡住,需要等很长一段时间才有返回,期间没有出现错误,显示GPU利用率99%,请问这是什么原因?
Reproduction
我的启动语句是:
lmdeploy serve api_server /workspace/qwen/Qwen2-72B-Instruct-AWQ --server-port 6005 --tp 2 --model-name [model_name] --cache-max-entry-count 0.8这是我部分代码:
def llm_result(query): json_data2 = { 'model': [model_name], 'messages': [ #所有的content加起来约3570tokens { 'role':'system', 'content':'xxx' }, { 'role': 'user', 'content': f'''xxx''' } ], } response = requests.post('http://[ip]:6005/v1/chat/completions', headers=headers, json=json_data2) text = json.loads(response.text) message = text["choices"][0]["message"]["content"] return message def main(): file="abc.xlsx" excel_file = os.path.join(dirs, file) df = pd.read_excel(excel_file) datas = df.values for data in datas: content=data[5] message=llm_result(content) time.sleep(0.3) print(message)Environment
sys.platform: linux Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 2147483648 GPU 0,1: NVIDIA A100-SXM4-40GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.8, V11.8.89 GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 PyTorch: 2.1.0 PyTorch compiling details: PyTorch built with: - GCC 9.3 - C++ Version: 201703 - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4) - OpenMP 201511 (a.k.a. OpenMP 4.5) - LAPACK is enabled (usually provided by MKL) - NNPACK is enabled - CPU capability usage: AVX512 - CUDA Runtime 11.8 - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37 - CuDNN 8.7 - Magma 2.6.1 - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, TorchVision: 0.16.0 LMDeploy: 0.5.1+unknown transformers: 4.42.4 gradio: 4.38.1 fastapi: 0.111.1 pydantic: 2.8.2 triton: 2.1.0 NVIDIA Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV12 0-35,72-107 0 N/A GPU1 NV12 X 0-35,72-107 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinksError traceback
No response
I cannot reproduce it on A100 80G
My script is :
import requests
import json
import time
def llm_result(query):
json_data2 = {
'model': 'qwen2',
'messages': [
{
'role':'system',
'content':''
},
{
'role': 'user',
'content': query
}
],
}
headers = {'Content-Type':'application/json'}
response = requests.post('http://0.0.0.0:6005/v1/chat/completions', headers=headers, json=json_data2)
text = json.loads(response.text)
message = text["choices"][0]["message"]["content"]
return message
datas = ["你好,你是谁"*1000]*600
for data in datas:
content=data
#print(content)
start_time = time.time()
message=llm_result(content)
end_time = time.time()
task_duration_seconds = round(end_time - start_time, 2)
time.sleep(0.3)
print(task_duration_seconds)
the input content is 4000tokens. And the time consumed for each response is about 1-3s. can you try to add --cache-max-entry-count 0.4 when start the api server.
使用qwen2.5-72B-Instruct的时候发现了同样的问题,请求两三千次后就会卡住
InternVL2-8B遇到相同的问题
@ChenZiHong-Gavin @nzomi
可以把 https://github.com/InternLM/lmdeploy/blob/v0.6.4/lmdeploy/vl/engine.py#L26-L27 这里面的 raise e 改成 sys.exit(1) 试试么,看看是vision 部分有问题还是 llm 部分有问题。或者用最新的版本 (0.6.4) 和 pytorch backend 来跑跑看,看下是否还是卡主。
@ChenZiHong-Gavin @nzomi
可以把 https://github.com/InternLM/lmdeploy/blob/v0.6.4/lmdeploy/vl/engine.py#L26-L27 这里面的
raise e改成sys.exit(1)试试么,看看是vision 部分有问题还是 llm 部分有问题。或者用最新的版本 (0.6.4) 和 pytorch backend 来跑跑看,看下是否还是卡主。
我目前测试下来似乎是高并发导致后续请求的推理耗时逐渐增加,超过一个阈值就卡死了 @irexyc 特别是有些图片的结果比较糟糕(比如不停的重复输出一直到max_token这种情况),后续卡死的概率更高,话说能提前检测到这种异常输出么?
@nzomi
因为提特征会比较耗时。如果一次收到的请求过多,那么后面的请求会等前面的处理完才开始处理,那么靠后的请求完成时间就会比较长。
卡死的话,需要明确一下现象。可以把 server 侧的日志打开 INFO 级别就可以,如果发请求,server测的日志不输出了,算卡死。 如果server测一直有输出,client侧重复输出,这个不算卡死。
前者的话,我们目前正在尝试复现,后者的话可以把 max_token 可以设小一点缓解。
@nzomi
因为提特征会比较耗时。如果一次收到的请求过多,那么后面的请求会等前面的处理完才开始处理,那么靠后的请求完成时间就会比较长。
卡死的话,需要明确一下现象。可以把 server 侧的日志打开 INFO 级别就可以,如果发请求,server测的日志不输出了,算卡死。 如果server测一直有输出,client侧重复输出,这个不算卡死。
前者的话,我们目前正在尝试复现,后者的话可以把 max_token 可以设小一点缓解。
@irexyc client端迟迟收不到结果,给server发送请求能从log看到接收到了新请求,所以不算卡死只是单纯处理得比较慢是吧。这样一来,后续请求什么时候收到响应感觉就是未知的,还得看之前有没有累积请求?
@nzomi
log level 设置为 INFO,然后看这种开头的日志是不是持续打印 [TM][INFO]
@irexyc 我部署后用grpc做了个中转,倒是没能看到[TM][INFO] 这个现象。我在服务端自己写的异步,限制semaphore的数量可以缓解超时这个问题,但是平均每次推理的耗时大幅增加了哦。
如果收到很多请求,这些请求是按先后顺序排队的吗?后续发的一定得等前面的结束才会开始吗?
卡住的问题,internvl 团队复现了,有稳定的复现脚本,正在解决中。
卡住的问题,internvl 团队复现了,有稳定的复现脚本,正在解决中。
您好!想问一下大概是什么问题导致的呢,最近我们也遇到了,团队的脚本可以借鉴一下么因为我们也不能稳定复现
抱歉,忘记更新现状。 transformers推理 vision部分和 turbomind 推理 llm 部分,怀疑因为通信卡住了。https://github.com/InternLM/lmdeploy/pull/3126 给了解决方案,但性能损失比较大。所以我们没有合入。 推荐改用 pytorch engine
可以考虑vision走gloo cpu通信?这样就不会和nccl后端冲突,也不会卡住
---- 回复的原邮件 ---- | 发件人 | Lyu @.> | | 日期 | 2025年07月16日 18:27 | | 收件人 | @.> | | 抄送至 | @.>@.> | | 主题 | Re: [InternLM/lmdeploy] 当连续请求200多次后,出现突然卡住的情况 (Issue #2231) | lvhan028 left a comment (InternLM/lmdeploy#2231)
抱歉,忘记更新现状。 transformers推理 vision部分和 turbomind 推理 llm 部分,怀疑因为通信卡住了。#3126 给了解决方案,但性能损失比较大。所以我们没有合入。 推荐改用 pytorch engine
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.
This issue is closed because it has been stale for 5 days. Please open a new issue if you have similar issues or you have any new updates now.
