sglang kv cache pool leak detected when benchmark llama13B-awq using A40

UserWarning: Warning: available_size=35244, max_total_num_token=42308 KV cache pool leak detected!

Jan 26 '24 03:01 lusirisyoud

Thanks for reporting this warning! Could you provide more details so we can reproduce that or find where the bugs are?

Jan 26 '24 04:01 hnyls2002

Jan 26 '24 07:01 lusirisyoud

The leaked KV cache is too large. I just reproduced the error in A10 24G with llama-13b-hf AWQ, I noticed that during benchmarking, the backend logs report CUDA-out-of-memory errors.

Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 165, in exposed_step
    self.forward_step()
  File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 180, in forward_step
    self.forward_fill_batch(new_batch)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 369, in forward_fill_batch
    logits, (logprobs, normalized_logprobs) = self.model_runner.forward(
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_runner.py", line 472, in forward
    return self.forward_extend(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_runner.py", line 377, in forward_extend
    return self.model.forward(input_ids, input_metadata.positions, input_metadata)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sglang/python/sglang/srt/models/llama2.py", line 269, in forward
    hidden_states = self.model(input_ids, positions, input_metadata, skip_embed)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sglang/python/sglang/srt/models/llama2.py", line 239, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sglang/python/sglang/srt/models/llama2.py", line 199, in forward
    hidden_states = self.mlp(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sglang/python/sglang/srt/models/llama2.py", line 60, in forward
    gate_up, _ = self.gate_up_proj(x)
                 ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 211, in forward
    output_parallel = self.linear_method.apply_weights(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sglang-venv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/awq.py", line 156, in apply_weights
    out = ops.awq_gemm(reshaped_x, qweight, scales, qzeros, pack_factor)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.56 GiB. GPU 0 has a total capacty of 22.19 GiB of which 975.50 MiB is free. Including non-PyTorch memory, this process has 21.23 GiB memory in use. Of the allocated memory 19.75 GiB is allocated by PyTorch, and 1.18 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I think this is the cause. Could you make sure that during benchmarking, your backend logs also print CUDA-out-of-memory errors?

Jan 26 '24 10:01 hnyls2002

I had a similar problem running on an EC2 g5.2xlarge instance (1 x A10G) using openchat/openchat3.5-0106. I have long sequences (6-7k tokens). A batch size of 19 sequences is fine, but cranking up to 38 sequences or more produced this error for me too.

I used sglang v0.1.11, but also tested (with the same result) sglang v0.1.9

/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py:179: UserWarning: Warning: available_size=38896, max_total_num_token=49196
KV cache pool leak detected!

I did, however get OOM errors first in other cases:

Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 140, in exposed_step
    self.forward_step()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 155, in forward_step
    self.forward_fill_batch(new_batch)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 344, in forward_fill_batch
    logits, (logprobs, normalized_logprobs) = self.model_runner.forward(
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 472, in forward
    return self.forward_extend(**kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 377, in forward_extend
    return self.model.forward(input_ids, input_metadata.positions, input_metadata)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 270, in forward
    return self.logits_processor(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/layers/logits_processor.py", line 52, in forward
    all_logprobs = torch.log(torch.softmax(logits.float(), dim=-1) + 1e-6)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 892.00 MiB. GPU 0 has a total capacty of 21.99 GiB of which 521.00 MiB is free. Including non-PyTorch memory, this process has 21.47 GiB memory in use. Of the allocated memory 20.93 GiB is allocated by PyTorch, and 242.91 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Feb 09 '24 14:02 pj-ml

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

Jul 25 '24 06:07 github-actions[bot]

sglang sglang copied to clipboard

kv cache pool leak detected when benchmark llama13B-awq using A40

sglang
sglang copied to clipboard