lmdeploy [Bug] Turbomind crash when client pass illegal top

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.

Describe the bug

When client passing illegal top_k, such as -1, the Turbomind server will crash.

[WARNING] topk (-1) is larger than max supported number (1024) for token 0 clip to max supported number 1024.

Exception in thread Thread-136:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.9/site-packages/lmdeploy/turbomind/turbomind.py", line 503, in _func
    output = self.model_insts[device_id].forward(
RuntimeError: [TM][ERROR] CUDA runtime error: an illegal memory access was encountered /lmdeploy/src/turbomind/utils/allocator.h:231

We face this issue both in api_server and TIS.

Reproduction

launch api_server

lmdeploy serve api_server ./workspace --server-name 0.0.0.0 --server-port 23333 --tp 1

start client and pass top_k=-1

from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient('http://0.0.0.0:23333')
model_name = api_client.available_models[0]

for item in api_client.chat_interactive_v1(model=model_name, prompt="hi", max_tokens=500, session_id=5, top_k=-1):
   print(item)

Environment

sys.platform: linux
Python: 3.9.16 (main, Aug 15 2023, 19:38:56) [GCC 8.3.1 20190311 (Red Hat 8.3.1-3)]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA A100-SXM4-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11)
PyTorch: 2.1.2+cu118
TorchVision: 0.16.2+cu121
LMDeploy: 0.2.5+
transformers: 4.37.1
gradio: 3.50.2
fastapi: 0.104.1
pydantic: 2.6.0

Mar 21 '24 03:03 ispobock

Hi @lvhan028 @lzhangzz @irexyc We may not assume that the input request parameters of users are all within reasonable ranges. When unexpected situations occur, only that individual request should be invalidated instead of causing the entire program to become unusable. The current exception handling and error processing are very fragile, posing significant risks in real online services.

Mar 21 '24 03:03 zhyncs

In this scenario, assuming there are two requests in sequence, the top_k parameter of the first request is illegal, such as -1. This causes TurboMind to hang directly. At this time, when the second normal request comes, it will not be responded to normally.

Mar 21 '24 03:03 zhyncs

I discussed with @ispobock that perhaps we may filter VerifySamplingParameters in ProcessInferRequests to ensure that only valid requests enter subsequent Initialize and Forward processes. At the same time, explicitly notify users through API Server response when a request fails due to invalid sampling parameters.

Mar 21 '24 03:03 zhyncs

Can we do the check at turbomind.py and chatbot.py? I think it's much simpler

Mar 21 '24 03:03 lvhan028

Can we do the check at turbomind.py and chatbot.py? I think it's much simpler

Exception handling and error processing is something that TurboMind should have itself, although it can also be achieved by relying on outer checks.

Mar 21 '24 03:03 zhyncs

If you are concerned about the complexity of modifying C++, we will create appropriate unit tests and conduct stability verification.

Mar 21 '24 03:03 zhyncs

@lvhan028

Even though we did the check in server level, the kernel crash issue still need to be fixed.

RuntimeError: [TM][ERROR] CUDA runtime error: an illegal memory access was encountered /lmdeploy/src/turbomind/utils/allocator.h:231

I also tried illegal top_p and temperature, it will print the warning but will not crash. The crash seems only happen for illegal top_k.

As @zhyncs mentioned, it seems better to have double-check in engine level to reject illegal requests.

Mar 21 '24 03:03 ispobock

Hi @lvhan028 Regarding this issue, if you do not have time to fix it at the moment, we expect to start fixing it after the holiday(4.7).

Mar 28 '24 07:03 zhyncs

[Bug] Turbomind crash when client pass illegal top_k

Checklist

Describe the bug

Reproduction

Environment