[Bug] Turbomind crash when client pass illegal top_k
Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
Describe the bug
When client passing illegal top_k, such as -1, the Turbomind server will crash.
[WARNING] topk (-1) is larger than max supported number (1024) for token 0 clip to max supported number 1024.
Exception in thread Thread-136:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.9/site-packages/lmdeploy/turbomind/turbomind.py", line 503, in _func
output = self.model_insts[device_id].forward(
RuntimeError: [TM][ERROR] CUDA runtime error: an illegal memory access was encountered /lmdeploy/src/turbomind/utils/allocator.h:231
We face this issue both in api_server and TIS.
Reproduction
- launch api_server
lmdeploy serve api_server ./workspace --server-name 0.0.0.0 --server-port 23333 --tp 1
- start client and pass top_k=-1
from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient('http://0.0.0.0:23333')
model_name = api_client.available_models[0]
for item in api_client.chat_interactive_v1(model=model_name, prompt="hi", max_tokens=500, session_id=5, top_k=-1):
print(item)
Environment
sys.platform: linux
Python: 3.9.16 (main, Aug 15 2023, 19:38:56) [GCC 8.3.1 20190311 (Red Hat 8.3.1-3)]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA A100-SXM4-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11)
PyTorch: 2.1.2+cu118
TorchVision: 0.16.2+cu121
LMDeploy: 0.2.5+
transformers: 4.37.1
gradio: 3.50.2
fastapi: 0.104.1
pydantic: 2.6.0
Hi @lvhan028 @lzhangzz @irexyc We may not assume that the input request parameters of users are all within reasonable ranges. When unexpected situations occur, only that individual request should be invalidated instead of causing the entire program to become unusable. The current exception handling and error processing are very fragile, posing significant risks in real online services.
In this scenario, assuming there are two requests in sequence, the top_k parameter of the first request is illegal, such as -1. This causes TurboMind to hang directly. At this time, when the second normal request comes, it will not be responded to normally.
I discussed with @ispobock that perhaps we may filter VerifySamplingParameters in ProcessInferRequests to ensure that only valid requests enter subsequent Initialize and Forward processes. At the same time, explicitly notify users through API Server response when a request fails due to invalid sampling parameters.
Can we do the check at turbomind.py and chatbot.py? I think it's much simpler
Can we do the check at turbomind.py and chatbot.py? I think it's much simpler
Exception handling and error processing is something that TurboMind should have itself, although it can also be achieved by relying on outer checks.
If you are concerned about the complexity of modifying C++, we will create appropriate unit tests and conduct stability verification.
@lvhan028
- Even though we did the check in server level, the kernel crash issue still need to be fixed.
RuntimeError: [TM][ERROR] CUDA runtime error: an illegal memory access was encountered /lmdeploy/src/turbomind/utils/allocator.h:231
I also tried illegal top_p and temperature, it will print the warning but will not crash. The crash seems only happen for illegal top_k.
- As @zhyncs mentioned, it seems better to have double-check in engine level to reject illegal requests.
Hi @lvhan028 Regarding this issue, if you do not have time to fix it at the moment, we expect to start fixing it after the holiday(4.7).