sglang
sglang copied to clipboard
kv cache pool leak detected when benchmark llama13B-awq using A40
UserWarning: Warning: available_size=35244, max_total_num_token=42308 KV cache pool leak detected!
Thanks for reporting this warning! Could you provide more details so we can reproduce that or find where the bugs are?
The leaked KV cache is too large. I just reproduced the error in A10 24G with llama-13b-hf AWQ, I noticed that during benchmarking, the backend logs report CUDA-out-of-memory errors.
Exception in ModelRpcClient:
Traceback (most recent call last):
File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 165, in exposed_step
self.forward_step()
File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 180, in forward_step
self.forward_fill_batch(new_batch)
File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 369, in forward_fill_batch
logits, (logprobs, normalized_logprobs) = self.model_runner.forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_runner.py", line 472, in forward
return self.forward_extend(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_runner.py", line 377, in forward_extend
return self.model.forward(input_ids, input_metadata.positions, input_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/models/llama2.py", line 269, in forward
hidden_states = self.model(input_ids, positions, input_metadata, skip_embed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/models/llama2.py", line 239, in forward
hidden_states, residual = layer(
^^^^^^
File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/models/llama2.py", line 199, in forward
hidden_states = self.mlp(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/models/llama2.py", line 60, in forward
gate_up, _ = self.gate_up_proj(x)
^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 211, in forward
output_parallel = self.linear_method.apply_weights(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/awq.py", line 156, in apply_weights
out = ops.awq_gemm(reshaped_x, qweight, scales, qzeros, pack_factor)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.56 GiB. GPU 0 has a total capacty of 22.19 GiB of which 975.50 MiB is free. Including non-PyTorch memory, this process has 21.23 GiB memory in use. Of the allocated memory 19.75 GiB is allocated by PyTorch, and 1.18 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I think this is the cause. Could you make sure that during benchmarking, your backend logs also print CUDA-out-of-memory errors?
I had a similar problem running on an EC2 g5.2xlarge instance (1 x A10G) using openchat/openchat3.5-0106. I have long sequences (6-7k tokens). A batch size of 19 sequences is fine, but cranking up to 38 sequences or more produced this error for me too.
I used sglang v0.1.11
, but also tested (with the same result) sglang v0.1.9
/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py:179: UserWarning: Warning: available_size=38896, max_total_num_token=49196
KV cache pool leak detected!
I did, however get OOM errors first in other cases:
Exception in ModelRpcClient:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 140, in exposed_step
self.forward_step()
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 155, in forward_step
self.forward_fill_batch(new_batch)
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 344, in forward_fill_batch
logits, (logprobs, normalized_logprobs) = self.model_runner.forward(
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 472, in forward
return self.forward_extend(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 377, in forward_extend
return self.model.forward(input_ids, input_metadata.positions, input_metadata)
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 270, in forward
return self.logits_processor(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/sglang/srt/layers/logits_processor.py", line 52, in forward
all_logprobs = torch.log(torch.softmax(logits.float(), dim=-1) + 1e-6)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 892.00 MiB. GPU 0 has a total capacty of 21.99 GiB of which 521.00 MiB is free. Including non-PyTorch memory, this process has 21.47 GiB memory in use. Of the allocated memory 20.93 GiB is allocated by PyTorch, and 242.91 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.