vllm [Hardware][Nvidia][Core][Feature] new feature add: vmm(virtual memory manage) kv cache for nvidia gpu

[New Feature Add] vmm(virtual memory manage) kv cache for nvidia gpu

This PR fix the feature request issue #4675 and refer to the vAttention paper

Description:

This PR addresses the inefficiencies of the manual cache page management scheme in vLLM by leveraging CUDA's Virtual Memory Management (VMM) API introduced in CUDA 10.2. By replacing the manual page management with CUDA driver’s native virtual memory solution, we aim to enhance performance and GPU memory utilization.

Motivation:

The current vLLM approach pre-allocates a large number of cache tensor blocks, which results in several issues:

Complex Manual Block Table Management:
- Managing the block table manually is cumbersome and introduces additional overhead.
Performance Overhead in Attention Kernels:
- The attention kernel needs to adapt to block-based cache management, which incurs performance penalties.
Rigid GPU Memory Utilization:
- The pre-allocation strategy lacks flexibility, leading to inefficient GPU memory use and potential waste.

Solution:

Utilizing CUDA's VMM API, we propose an improved cache management scheme:

Contiguous Virtual Cache Tensor Allocation:
- For each sequence, we allocate a contiguous virtual tensor with the maximum sequence length without pre-allocating physical GPU memory.
On-demand Physical Memory Allocation:
- Physical memory is dynamically allocated and mapped as needed during inference, rather than upfront.
Offloading Management to GPU Driver:
- Memory management and map is handled by the GPU driver, enhancing flexibility and potentially improving performance.

Benefits:

Simplified Cache Management:
- Eliminates the need for manual block tables management.
Performance Improvement:
- Reduces overhead in the attention kernel due to simplified memory management.
- Preliminary results show increased performance and flexibility in GPU memory usage.
Enhanced GPU Memory Utilization:
- Dynamic allocation reduces memory waste and improves utilization.

Compatibility and Usage:

This feature is added as an optional switch and is fully compatible with existing code.
To enable the VMM-based management, simply use the use_vmm=True flag when constructing the LLM class.

Current Status and Limitations:

The current version is a preliminary version with many limitations and we are in the process of developing it. The limitation such as:

Now only for Nvidia GPU and flash-attn backend
Doesn't support some existing features, such as prefix cache and chunk prefill ...
Nvidia GPU vmm minimum allocation granularity is 2MB, which is too large in some cases and can lead to significant in-block fragmentation
...

We are working hard to improve it and welcome everyone to join us.

Testing:

We tested the qwen2-72b with tp=4 / qwen-7b with tp=1 model on H20 GPUs . After using VMM:

The end to end inference speed ranged from -2% to +20% tokens/s.
Cache memory is allocated as needed, significantly reduce gpu memory usage for small batches or short output length.

qwen2 inference speed experiment

Conclusion:

This PR introduces an optimized cache management scheme using CUDA's Virtual Memory Management API. The changes enhance GPU memory utilization, simplify cache management, and provide a pathway for further performance improvements.

Thank you for reviewing this PR. I look forward to your feedback.

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

Jul 03 '24 13:07 izhuhaoran

@WoosukKwon ，Could you please take a look at this PR #6102 when you have a moment? I would appreciate your feedback and a review.

Jul 04 '24 03:07 izhuhaoran

Have U test the improvement in A100/A800?

Jul 04 '24 09:07 Qiubo1

Have U test the improvement in A100/A800?

We are testing it, and trying to give results later.

Jul 04 '24 09:07 izhuhaoran

A100 qwen2-72b results: @Qiubo1 qwen2-72b2

Jul 08 '24 04:07 izhuhaoran

From the source code, you did not Hide Memory Allocation Latency , and just allocate the memory before execution:

if self.use_vmm: > execute_model_req.allocated_block_counts = scheduler_outputs.allocated_block_counts output = self.model_executor.execute_model(execute_model_req=execute_model_req) Is there any way to hidden the memory allocation ?

Jul 11 '24 03:07 Ronaldo9RR

Physical memory is dynamically allocated and mapped as needed during inference, rather than upfront.

@Ronaldo9RR Thank you for your feedback.

"Physical memory is dynamically allocated and mapped as needed during inference, rather than upfront." This may not be accurate, sorry for the confusion. In response, what I meant was that physical memory doesn't have to be allocated a large amount of fixed memory in advance (during the init phase), but can be dynamically allocated as needed, which allows for more flexibility in memory usage.

More specifically, the purpose of this work is not to bring much improvement to the inference speed, but for the flexible use of gpu memory, optimising vllm in terms of wastefulness for a large amount of fixed occupied gpu memory. In addition, based on the fact that vllm also has a cache tensor that is a whole contiguous space, it might be more friendly to optimise the implementation of the attn operator, without having to take into account the impact of block table/manual page management!

As you can see, it's just a preliminary version at the moment, and since the overhead of the cuda vmm api to allocate physical gpu memory is small, overlap of memory allocation and computation has not been implemented yet, and we're in the process of developing it. But overlap is not a problem, the memory usage is only related to the context length, if we need to allocate an additional physical block in round t, we can use a background thread to pre-allocate it in round t-1, and overlap with the computation in round t-1 (as mentioned in the vAttention paper).

Jul 11 '24 08:07 izhuhaoran

Physical memory is dynamically allocated and mapped as needed during inference, rather than upfront.

@Ronaldo9RR Thank you for your feedback.

"Physical memory is dynamically allocated and mapped as needed during inference, rather than upfront." This may not be accurate, sorry for the confusion. In response, what I meant was that physical memory doesn't have to be allocated a large amount of fixed memory in advance (during the init phase), but can be dynamically allocated as needed, which allows for more flexibility in memory usage.

More specifically, the purpose of this work is not to bring much improvement to the inference speed, but for the flexible use of gpu memory, optimising vllm in terms of wastefulness for a large amount of fixed occupied gpu memory. In addition, based on the fact that vllm also has a cache tensor that is a whole contiguous space, it might be more friendly to optimise the implementation of the attn operator, without having to take into account the impact of block table/manual page management!

As you can see, it's just a preliminary version at the moment, and since the overhead of the cuda vmm api to allocate physical gpu memory is small, overlap of memory allocation and computation has not been implemented yet, and we're in the process of developing it. But overlap is not a problem, the memory usage is only related to the context length, if we need to allocate an additional physical block in round t, we can use a background thread to pre-allocate it in round t-1, and overlap with the computation in round t-1 (as mentioned in the vAttention paper).

In vAttention paper, it assumed that in decode phase every iteration latency is 10~100ms, but cuMemap/cuMemSetAccess is 2/38 ms respectively, the overlap of memory allocation and computation needs carefully-crafted, but I also agreed that the overlapis not the big deal, but it is important in the LLM-inference.

Jul 11 '24 09:07 Ronaldo9RR

In vAttention paper, it assumed that in decode phase every iteration latency is 10~100ms, but cuMemap/cuMemSetAccess is 2/38 ms respectively, the overlap of memory allocation and computation needs carefully-crafted, but I also agreed that the overlapis not the big deal, but it is important in the LLM-inference.

It is indeed as you said. Thank you very much for your suggestion, we will try to implement it soon!

Jul 11 '24 09:07 izhuhaoran

Hi @izhuhaoran I am trying to reproduce the result here. I tried to check out your branch and run pip install -e . (I do not have access to docker to run the docker build) on A100. I ran into the following error:


[rank0]: AttributeError: '_OpNamespace' '_C_cache_ops' object has no attribute 'reshape_and_cache_flash'

Any suggestion here? To reproduce the figure you mentioned earlier in this PR, is there a script within this repro to reproduce it? I notice that in the CI there is a script running benchmark but that does not seem to be the one producing the figure you have in this PR. Thank you in advance.

Jul 24 '24 17:07 chakpongchung

Hi @izhuhaoran I am trying to reproduce the result here. I tried to check out your branch and run pip install -e . (I do not have access to docker to run the docker build) on A100. I ran into the following error:
[rank0]: AttributeError: '_OpNamespace' '_C_cache_ops' object has no attribute 'reshape_and_cache_flash'
Any suggestion here? To reproduce the figure you mentioned earlier in this PR, is there a script within this repro to reproduce it? I notice that in the CI there is a script running benchmark but that does not seem to be the one producing the figure you have in this PR. Thank you in advance.

Since I added some c++/cuda code in csrc, you may should build the source code first. You can try the following:

python setup.py install

or

python setup.py build_ext --inplace 
export PYTHONPATH=<Your vllm dir>:$PYTHONPATH

Jul 25 '24 03:07 izhuhaoran

@WoosukKwon @youkaichao @mgoin @comaniac, Just a gentle reminder about this PR [#6102 ]. Sorry to bother you all, but I am eager to receive your comments on this pr. Your insights and suggestions are highly valued, and I'm eager to incorporate any feedback you might have to improve this pr. Please let me know if there's a more convenient time for you to take a look or if there are any specific areas you'd like me to address before proceeding with the review.

Jul 25 '24 11:07 izhuhaoran

Hi @izhuhaoran I am trying to reproduce the result here. I tried to check out your branch and run pip install -e . (I do not have access to docker to run the docker build) on A100. I ran into the following error:
[rank0]: AttributeError: '_OpNamespace' '_C_cache_ops' object has no attribute 'reshape_and_cache_flash'
Any suggestion here? To reproduce the figure you mentioned earlier in this PR, is there a script within this repro to reproduce it? I notice that in the CI there is a script running benchmark but that does not seem to be the one producing the figure you have in this PR. Thank you in advance.
Since I added some c++/cuda code in csrc, you may should build the source code first. You can try the following:
python setup.py install
or
python setup.py build_ext --inplace 
export PYTHONPATH=<Your vllm dir>:$PYTHONPATH

I tried both of the suggestion here but the problem is still the same. Looks like the build_ext command above does not trigger the build for cache_ops.impl("reshape_and_cache_flash", torch::kCUDA, &reshape_and_cache_flash);.

The problem should be reproducible using :

pytest test_cache.py::test_reshape_and_cache_flash

Jul 25 '24 21:07 chakpongchung

I tried both of the suggestion here but the problem is still the same. Looks like the build_ext command above does not trigger the build for cache_ops.impl("reshape_and_cache_flash", torch::kCUDA, &reshape_and_cache_flash);.

The problem should be reproducible using :
pytest test_cache.py::test_reshape_and_cache_flash

Unfortunately I run pytest test_cache.py::test_reshape_and_cache_flash and I don't get the error you mentioned. Also reshape_and_cache_flash is a kernel that comes with vllm itself, and my additions won't affect it here, so make sure the build ends up running properly, i.e. python setup.py build_ext -inplace and export PYTHONPATH=<Your vllm dir>:$PYTHONPATH runs correctly, and then execute your test file within the current terminal. Hope this helps you.

Jul 26 '24 03:07 izhuhaoran

Thank you @izhuhaoran. Would you like to share the script needed to reproduce the experimental result along with the bar chart?

Jul 26 '24 20:07 chakpongchung

Thanks for the contribution and this looks exciting! Some thoughts/questions:

Have you benchmarked on other GPUs, such as H100, A10g and L4?
Why it only works for flash attention (but not xFormers and FlashInfer)? Is there any kernel implementation requirements to make use of VMM, or just haven't tested yet?
It seems to me that to enable this feature we need to introduce new components such as a specialized block manager. If so we should also consider architecture extensibility. For example, we should make the block manager more modularized like worker and model runner.
Echo @chakpongchung' comment that it would be good to provide the reproducible script so that community members could evaluate the PR locally. I believe this will facilitate the upstream process.

Jul 26 '24 21:07 comaniac

Thank you @izhuhaoran. Would you like to share the script needed to reproduce the experimental result along with the bar chart?

Here is my test script: test_llm.py.txt

Jul 27 '24 04:07 izhuhaoran

Thanks for the contribution and this looks exciting! Some thoughts/questions:

Have you benchmarked on other GPUs, such as H100, A10g and L4?

Why it only works for flash attention (but not xFormers and FlashInfer)? Is there any kernel implementation requirements to make use of VMM, or just haven't tested yet?

It seems to me that to enable this feature we need to introduce new components such as a specialized block manager. If so we should also consider architecture extensibility. For example, we should make the block manager more modularized like worker and model runner.

Echo @chakpongchung' comment that it would be good to provide the reproducible script so that community members could evaluate the PR locally. I believe this will facilitate the upstream process.

Thank you for your comments!

I haven't tested it on these gpus yet as I don't have the relevant gpus now.
vmm has no additional requirements for the kernel, the attention kernel is not aware of vmm, we just do the underlying memory management for a whole non-page cache tensor. Just not adapted like Flashinfer yet.

Jul 27 '24 05:07 izhuhaoran

Hi, I tried this pr, and it runs failed with this error:

I run this using llm server, haved you tried this situation?

Here is my script.

python3 -u -m vllm.entrypoints.openai.api_server \
        --port ${port} \
        --model ${model} \
        --dtype auto \
        -tp ${tp} \
        --max-model-len 4096 \
        --max-num-seqs 256 \
        --gpu-memory-utilization 0.95 \
        --disable-log-stats \
        --disable-log-requests  \
        --enable-prefix-caching \

RuntimeError: CUDA error: an illegal memory access was encountered

  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/vllm/vllm/model_executor/models/llama.py", line 306, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/vllm/vllm/model_executor/models/llama.py", line 230, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/vllm/vllm/model_executor/models/llama.py", line 164, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/vllm/vllm/attention/layer.py", line 94, in forward
    return self.impl.forward(query, key, value, kv_cache, attn_metadata,
  File "/opt/vllm/vllm/attention/backends/flash_attn.py", line 373, in forward
    out = flash_attn_varlen_func(
  File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 1099, in flash_attn_varlen_func
    return FlashAttnVarlenFunc.apply(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 596, in forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
  File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 88, in _flash_attn_varlen_forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@izhuhaoran

Aug 16 '24 03:08 sleepwalker2017

Hi, I tried this pr, and it runs failed with this error:

I run this using llm server, haved you tried this situation?

Here is my script.

python3 -u -m vllm.entrypoints.openai.api_server \
        --port ${port} \
        --model ${model} \
        --dtype auto \
        -tp ${tp} \
        --max-model-len 4096 \
        --max-num-seqs 256 \
        --gpu-memory-utilization 0.95 \
        --disable-log-stats \
        --disable-log-requests  \
        --enable-prefix-caching \

RuntimeError: CUDA error: an illegal memory access was encountered

  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/vllm/vllm/model_executor/models/llama.py", line 306, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/vllm/vllm/model_executor/models/llama.py", line 230, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/vllm/vllm/model_executor/models/llama.py", line 164, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/vllm/vllm/attention/layer.py", line 94, in forward
    return self.impl.forward(query, key, value, kv_cache, attn_metadata,
  File "/opt/vllm/vllm/attention/backends/flash_attn.py", line 373, in forward
    out = flash_attn_varlen_func(
  File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 1099, in flash_attn_varlen_func
    return FlashAttnVarlenFunc.apply(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 596, in forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
  File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 88, in _flash_attn_varlen_forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@izhuhaoran

@izhuhaoran what does the block_byte_size exactly mean? is it similar to the block_size in vllm ?

If we don't modify the cuda driver, the block_byte_size is 2MB, and I tried this on my H100.

I test it using single H100, vicuna 13B, and seq_len = 710, and output len avg = 190.

Using baseline vllm, I can run 96 batches at most, but using vmm, the max concurrency is only 40+, is that because the block_byte_size is too large?

Aug 16 '24 03:08 sleepwalker2017

Hi, I tried this pr, and it runs failed with this error:

RuntimeError: CUDA error: an illegal memory access was encountered

  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/vllm/vllm/model_executor/models/llama.py", line 306, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/vllm/vllm/model_executor/models/llama.py", line 230, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/vllm/vllm/model_executor/models/llama.py", line 164, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/vllm/vllm/attention/layer.py", line 94, in forward
    return self.impl.forward(query, key, value, kv_cache, attn_metadata,
  File "/opt/vllm/vllm/attention/backends/flash_attn.py", line 373, in forward
    out = flash_attn_varlen_func(
  File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 1099, in flash_attn_varlen_func
    return FlashAttnVarlenFunc.apply(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 596, in forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
  File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 88, in _flash_attn_varlen_forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@izhuhaoran

as 'CUDA kernel errors might be asynchronously' ，can you passing CUDA_LAUNCH_BLOCKING=1 and re-run to figure out where the error occurs？

Aug 16 '24 03:08 izhuhaoran

@izhuhaoran what does the block_byte_size exactly mean? is it similar to the block_size in vllm ?

If we don't modify the cuda driver, the block_byte_size is 2MB, and I tried this on my H100.

I test it using single H100, vicuna 13B, and seq_len = 710, and output len avg = 190.

Using baseline vllm, I can run 96 batches at most, but using vmm, the max concurrency is only 40+, is that because the block_byte_size is too large?

yes, block_byte_size is similar to base vllm's block_size to control one cache block size. This cuda error is probably due to the prefix cache, prefix cache is not well supported in my pr for now, I'm working on refining it

Aug 16 '24 04:08 izhuhaoran

I can't get much info using CUDA_LAUNCH_BLOCKING=1

ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/engine/async_llm_engine.py", line 247, in step_async^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     output = await self.model_executor.execute_model_async(^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/executor/gpu_executor.py", line 122, in execute_model_async^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     output = await make_async(self.driver_worker.execute_model^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     result = self.fn(*self.args, **self.kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/worker/worker_base.py", line 282, in execute_model^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     output = self.model_runner.execute_model(^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return func(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/worker/model_runner.py", line 1288, in execute_model^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     hidden_or_intermediate_states = model_executable(^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return self._call_impl(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return forward_call(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/model_executor/models/llama.py", line 402, in forward^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     model_output = self.model(input_ids, positions, kv_caches,^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return self._call_impl(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return forward_call(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/model_executor/models/llama.py", line 307, in forward^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     hidden_states, residual = layer(^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return self._call_impl(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return forward_call(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/model_executor/models/llama.py", line 231, in forward^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     hidden_states = self.self_attn(^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return self._call_impl(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return forward_call(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/model_executor/models/llama.py", line 165, in forward^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return self._call_impl(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return forward_call(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/attention/layer.py", line 99, in forward^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     ret = self.impl.forward(query, key, value, kv_cache, attn_metadata,^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/attention/backends/flash_attn.py", line 355, in forward^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     output = torch.empty_like(query)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53] RuntimeError: CUDA error: an illegal memory access was encountered^M
ERROR 08-16 04:15:24 async_llm_engine.py:53] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]

Aug 16 '24 04:08 sleepwalker2017

@izhuhaoran what does the block_byte_size exactly mean? is it similar to the block_size in vllm ? If we don't modify the cuda driver, the block_byte_size is 2MB, and I tried this on my H100. I test it using single H100, vicuna 13B, and seq_len = 710, and output len avg = 190. Using baseline vllm, I can run 96 batches at most, but using vmm, the max concurrency is only 40+, is that because the block_byte_size is too large?

yes, block_byte_size is similar to base vllm's block_size to control one cache block size. This cuda error is probably due to the prefix cache, prefix cache is not well supported in my pr for now, I'm working on refining it

I disabled prefix caching, the issue still exists.

Aug 16 '24 04:08 sleepwalker2017

I disabled prefix caching, the issue still exists.

As you say, there's less info now, and it's hard to see what's wrong. Judging by the error occurring at flash_attn.pyoutput = torch.empty_like(query), it's very likely that there's a problem with cache write of ops.reshape_and_cache_vmm .

Thanks for your feedback, I will try to reproduce your error and try to fix it, also looking forward to your further work on this issue. BTW, does this issue still exist if you use offline inference.

Aug 16 '24 04:08 izhuhaoran

I can't get much info using CUDA_LAUNCH_BLOCKING=1

ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/engine/async_llm_engine.py", line 247, in step_async^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     output = await self.model_executor.execute_model_async(^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/executor/gpu_executor.py", line 122, in execute_model_async^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     output = await make_async(self.driver_worker.execute_model^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     result = self.fn(*self.args, **self.kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/worker/worker_base.py", line 282, in execute_model^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     output = self.model_runner.execute_model(^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return func(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/worker/model_runner.py", line 1288, in execute_model^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     hidden_or_intermediate_states = model_executable(^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return self._call_impl(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return forward_call(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/model_executor/models/llama.py", line 402, in forward^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     model_output = self.model(input_ids, positions, kv_caches,^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return self._call_impl(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return forward_call(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/model_executor/models/llama.py", line 307, in forward^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     hidden_states, residual = layer(^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return self._call_impl(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return forward_call(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/model_executor/models/llama.py", line 231, in forward^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     hidden_states = self.self_attn(^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return self._call_impl(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return forward_call(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/model_executor/models/llama.py", line 165, in forward^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return self._call_impl(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return forward_call(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/attention/layer.py", line 99, in forward^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     ret = self.impl.forward(query, key, value, kv_cache, attn_metadata,^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/attention/backends/flash_attn.py", line 355, in forward^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     output = torch.empty_like(query)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53] RuntimeError: CUDA error: an illegal memory access was encountered^M
ERROR 08-16 04:15:24 async_llm_engine.py:53] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]

I had encountered a similar error(unrelated to this PR). Does this problem only occur when tp > 1? Will it happen when tp = 1? If the problem only occurs when tp > 1, I suggest checking whether all the inputs(q, k, v, kv_cache, attn_metadata) are on the same device.

Aug 16 '24 05:08 jeejeelee

I disabled prefix caching, the issue still exists.

As you say, there's less info now, and it's hard to see what's wrong. Judging by the error occurring at flash_attn.pyoutput = torch.empty_like(query), it's very likely that there's a problem with cache write of ops.reshape_and_cache_vmm .

Thanks for your feedback, I will try to reproduce your error and try to fix it, also looking forward to your further work on this issue. BTW, does this issue still exist if you use offline inference.

Hi，let me make it more clear.

If I run using openai api server, it crashes with illegal memory access.
If I run offline inference, it's ok, but the max concurrency is much lower than before.

I wonder how it will allocate memory at the very beginning.

For example, I run vicuna 13B with 40 layers and hidden dim = 5120,

So ,at first, it will allocate a block for each batch and each layer, right?

The memory needed is 40 (layer_num) * 200 (max batch) * 2 (for k and v) = 16000 blocks?

And the memory is 16k * 2M = 16G?

Is that the fact ?

Aug 16 '24 06:08 sleepwalker2017

I can't get much info using CUDA_LAUNCH_BLOCKING=1

ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/engine/async_llm_engine.py", line 247, in step_async^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     output = await self.model_executor.execute_model_async(^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/executor/gpu_executor.py", line 122, in execute_model_async^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     output = await make_async(self.driver_worker.execute_model^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     result = self.fn(*self.args, **self.kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/worker/worker_base.py", line 282, in execute_model^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     output = self.model_runner.execute_model(^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return func(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/worker/model_runner.py", line 1288, in execute_model^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     hidden_or_intermediate_states = model_executable(^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return self._call_impl(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return forward_call(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/model_executor/models/llama.py", line 402, in forward^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     model_output = self.model(input_ids, positions, kv_caches,^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return self._call_impl(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return forward_call(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/model_executor/models/llama.py", line 307, in forward^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     hidden_states, residual = layer(^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return self._call_impl(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return forward_call(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/model_executor/models/llama.py", line 231, in forward^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     hidden_states = self.self_attn(^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return self._call_impl(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return forward_call(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/model_executor/models/llama.py", line 165, in forward^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return self._call_impl(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     return forward_call(*args, **kwargs)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/attention/layer.py", line 99, in forward^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     ret = self.impl.forward(query, key, value, kv_cache, attn_metadata,^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]   File "/opt/vllm/vllm/attention/backends/flash_attn.py", line 355, in forward^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]     output = torch.empty_like(query)^M
ERROR 08-16 04:15:24 async_llm_engine.py:53] RuntimeError: CUDA error: an illegal memory access was encountered^M
ERROR 08-16 04:15:24 async_llm_engine.py:53] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.^M
ERROR 08-16 04:15:24 async_llm_engine.py:53]

I had encountered a similar error(unrelated to this PR). Does this problem only occur when tp > 1? Will it happen when tp = 1? If the problem only occurs when tp > 1, I suggest checking whether all the inputs(q, k, v, kv_cache, attn_metadata) are on the same device.

I use tp = 1, it still occurred.

Aug 16 '24 06:08 sleepwalker2017

I see the code here

            single_token_bytes_size = head_size * num_heads * dtype_size
            # We can divide a block equally among all layers, which reduces
            # the number of vmm memory operations.
            single_token_bytes_size *= num_layers
            # vmm only support flash-attn now, which need block_size % 16 == 0
            min_block_size = single_token_bytes_size * 16
            self.block_bytes_size = math.lcm(  # type: ignore[attr-defined]
                self.block_bytes_size, min_block_size)
            self.block_size = self.block_bytes_size // single_token_bytes_size

in my case, the page size is 50M, and the block size is 128.

INFO 08-16 06:39:03 arg_utils.py:707] use vmm 50MB block size: 128

Is that the reason why concurrency is much lower?

Aug 16 '24 06:08 sleepwalker2017

I set block size = 128 in pure vllm without vmm, the max concurrency can still be up to 100 sequences, but when I change to vmm version ,the max concurrency is no more than 50.

I think that's weird

Aug 16 '24 07:08 sleepwalker2017

When running offline inference, there is still crash cases.

I run using vicuna 13B on H100 and the input sequence length is 710 in avg.

 usage: 0.0%.
^MProcessed prompts:   2%|▎         | 300/12000 [01:12<42:43,  4.56it/s, est. speed input: 2967.39 toks/s, output: 1066.54 toks/s]WARNING 08-16 07:14:59 scheduler.py:1141] Sequence group 384 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data//vllm/vattn/offline.py", line 89, in <module>
[rank0]:     test()
[rank0]:   File "/data//vllm/vattn/offline.py", line 87, in test
[rank0]:     test_llm(model, n, max_tokens, use_vmm, tp_size)
[rank0]:   File "/data//vllm/vattn/offline.py", line 52, in test_llm
[rank0]:     outputs = llm.generate(prompts_choose, sampling_params)
[rank0]:   File "/opt/vllm/vllm/utils.py", line 822, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/opt/vllm/vllm/entrypoints/llm.py", line 309, in generate
[rank0]:     outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]:   File "/opt/vllm/vllm/entrypoints/llm.py", line 561, in _run_engine
[rank0]:     step_outputs = self.llm_engine.step()
[rank0]:   File "/opt/vllm/vllm/engine/llm_engine.py", line 852, in step
[rank0]:     0].schedule()
[rank0]:   File "/opt/vllm/vllm/core/scheduler.py", line 987, in schedule
[rank0]:     scheduler_outputs = self._schedule()
[rank0]:   File "/opt/vllm/vllm/core/scheduler.py", line 962, in _schedule
[rank0]:     return self._schedule_default()
[rank0]:   File "/opt/vllm/vllm/core/scheduler.py", line 800, in _schedule_default
[rank0]:     remaining_waiting, prefills = self._schedule_prefills(
[rank0]:   File "/opt/vllm/vllm/core/scheduler.py", line 714, in _schedule_prefills
[rank0]:     can_allocate = self.block_manager.can_allocate(seq_group)
[rank0]:   File "/opt/vllm/vllm/core/block_manager_vmm.py", line 125, in can_allocate
[rank0]:     assert seq.cache_buffer_id == -1
[rank0]: AssertionError

Aug 16 '24 07:08 sleepwalker2017

vllm vllm copied to clipboard

[Hardware][Nvidia][Core][Feature] new feature add: vmm(virtual memory manage) kv cache for nvidia gpu

[New Feature Add] vmm(virtual memory manage) kv cache for nvidia gpu

Description:

Motivation:

Solution:

Benefits:

Compatibility and Usage:

Current Status and Limitations:

Testing:

Conclusion:

PR Title and Classification

Code Quality

Notes for Large Changes

What to Expect for the Reviews

Thank You

vllm
vllm copied to clipboard