lmdeploy [Bug] lmdeploy serve异步调用过快会造成死锁？该如何限制？还是说lmdeploy serve启动微调后模型会造成调度失灵？

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

第一次执行很流畅，调用没有微调的internlm2-chat-20b。花费300s 第二次执行，调用使用xtuner微调的internlm2-chat-20b，运行30min没有结束。在A100上运行，GPU使用率始终100%，显存占用维持在98%

Reproduction

服务端运行：lmdeploy serve api_server /root/ft/internlm2-chat-20b --server-port 23333

异步代码：

# 允许在已有事件循环中嵌套调用异步代码
nest_asyncio.apply()
async def fetch_completion(client, task, model_name, n):
    messages = [
        {"role": "system", "content": content},
        {"role": "user", "content": task}
    ]
    start_time = time.time()
    completion = await client.chat.completions.create(
        model=model_name,
        messages=messages
    )
    elapsed_time = time.time() - start_time
    response = completion.choices[0].message.content
    print(f"第 {n+1} 条数据用时: {elapsed_time:.2f} 秒")
    return {
        "instruction": content,
        "input": task,
        "output": response
    }

async def baseline_model(client, tasks, model_name):
    res = []
    for task in tasks:
        num_gen = 50
        tasks_list = [fetch_completion(client, task, model_name, n) for n in range(num_gen)]
        for result in tqdm_asyncio.as_completed(tasks_list, desc=f'Processing task: {task[:20]}'):
            res.append(await result)
    return res

def write_results_to_file(results, file_path):
    os.makedirs(file_path, exist_ok=True)
    submit_path = file_path + "submit.json"
    with open(submit_path, 'w', encoding='utf-8') as f:
        for result in results:
            f.write(json.dumps(result, ensure_ascii=False) + "\n")

async def main():
    client = AsyncOpenAI(api_key='sk-wjkljhfdhajfhkadfkh',
                         base_url='http://0.0.0.0:23333/v1')
    model_cards = await client.models.list()._get_page()
    model_name = model_cards.data[0].id
    res_novel = await baseline_model(client, tasks, model_name)
    return res_novel


# 运行异步主函数
all_start_time = time.time()
res = asyncio.run(main())
all_time = time.time() - all_start_time
write_results_to_file(res, file_path)
print(all_time)

Environment

使用A100，InternStudio平台上进行

sys.platform: linux
Python: 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA A100-SXM4-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.2, V12.2.140
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.1.2
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.16.2
LMDeploy: 0.5.0+
transformers: 4.41.2
gradio: 3.50.2
fastapi: 0.111.0
pydantic: 2.8.2
triton: 2.1.0

Error traceback

No response

Jul 23 '24 13:07 Volta-lemon

正常数据输出速率为30s左右单条数据，但输出如下：

Processing task: 90%|█████████ | 45/50 [00:25<00:05, 1.16s/it] 第 10 条数据用时: 25.81 秒 Processing task: 92%|█████████▏| 46/50 [00:32<00:09, 2.45s/it] 第 4 条数据用时: 32.59 秒 Processing task: : 94%|█████████▍| 47/50 [01:03<00:27, 9.23s/it] 第 46 条数据用时: 63.00 秒 Processing task: 96%|█████████▌| 48/50 [03:29<01:29, 44.59s/it] 第 13 条数据用时: 209.30 秒 Processing task: 98%|█████████▊| 49/50 [10:13<02:21, 141.32s/it] 第 28 条数据用时: 613.31 秒 Processing task: 100%|██████████| 50/50 [10:16<00:00, 12.34s/it] 第 24 条数据用时: 616.64 秒 Processing task: 2%|▏ | 1/50 [00:03<02:47, 3.43s/it] 第 42 条数据用时: 3.43 秒

Jul 23 '24 13:07 Volta-lemon

这次总用时：3589s，相较于之前多消耗了10倍，我观察到大部分数据条都是20s以下，但个别几条超时严重，我怀疑是由于调度问题，导致100%的GPU使用率，导致了类似这样死锁的问题。

Jul 23 '24 13:07 Volta-lemon

请提供使用的模型。另外可以试试 benchmark/profile_restful_api.py 脚本跑一下benchmark，如果没问题那么可能是模型的问题（没有正确获取停止符之类）或者客户端脚本问题。

Jul 24 '24 03:07 AllentDan