vllm [Bug]: Simultaneous mm calls lead to permanently degraded performance.

Your current environment

The output of `python collect_env.py`

Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64) - WSL
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.11.10 (main, Oct  3 2024, 07:29:13) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.5.119
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4080 SUPER
Nvidia driver version: 555.99
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      39 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             28
On-line CPU(s) list:                0-27
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Core(TM) i7-14700KF
CPU family:                         6
Model:                              183
Thread(s) per core:                 2
Core(s) per socket:                 14
Socket(s):                          1
Stepping:                           1
BogoMIPS:                           6835.20
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Virtualization:                     VT-x
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          672 KiB (14 instances)
L1i cache:                          448 KiB (14 instances)
L2 cache:                           28 MiB (14 instances)
L3 cache:                           33 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.77
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.45.1
[pip3] triton==3.0.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.1.3.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.0.2.54                pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.2.106               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.4.5.107               pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.1.0.106               pypi_0    pypi
[conda] nvidia-ml-py              12.560.30                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.6.77                  pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
[conda] pyzmq                     26.2.0                   pypi_0    pypi
[conda] torch                     2.4.0                    pypi_0    pypi
[conda] torchvision               0.19.0                   pypi_0    pypi
[conda] transformers              4.45.1                   pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.1.dev238+ge2c6e0a82
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X                              N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Model Input Dumps

No response

🐛 Describe the bug

I don't know whether this is user error, or a bug. But I'm doing multimodal inference and I noticed I could have multiple instances of my script making calls to the model and each would get the same throughput, leading to a higher avg generation throughput. Having noticed this I modified my script to make it asynchronous. But when I run it the avg generation throughput is now worse than if I was just running one instance. And when I go back to running just a single synchronous instance I'm now getting even worse throughput. The only way to get it to return to it's original performance is to restart the model. I have however noticed that if the calls are still asynchronous, but have a staggered delay to them I have no problems.

To clarify with rough numbers:

setup	avg gen throughput (tokens/s)
A single synchronous instance	~50
3 manually started synchronous instances	~150
1 asynchronous instance making 3 calls at a time	~45
A single synchronous instance after running async	~15
Restarted Server	N/A
1 asynchronous instance making 3 staggered calls	~150

Looking at these it appears that something about the way I'm running the asynchronous version hits some sort of "resource limit" and so the amount of resources each call is assigned are limited (15 * 3 = 45). But when running again with a smaller load this "call limit" is still in place. Maybe?

Possibly supporting this theory is the fact that when I run the staggered version the vram usage doesn't get as high. But I don't know why the vram usage is lower because in the end it's still running 3 requests at once. And I don't have enough headroom to truly test it.

You can reproduce with with the below code. Just update the open ai connection to point it at your own model. You can achieve each of the different behaviours by adjusting the num_parallel and delay parameters.

Serving Command:

vllm serve $MODELS/openbmb/MiniCPM-V-2_6 --enforce-eager --max-model-len 4096 --max-seq-len 2048 --served-model-name hello --trust-remote-code --limit-mm-per-prompt image=2 --gpu-memory-utilization 0.95 --quantization fp8

Code:

import asyncio

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import SystemMessage
from langchain_core.runnables import Runnable

from langchain_openai import ChatOpenAI

def get_transcribe_chain() -> Runnable:
    chat = ChatOpenAI(model="hello", base_url="http://localhost:8000/v1", api_key="EMPTY", temperature=0.0)
    prompt_template = ChatPromptTemplate.from_messages([
        SystemMessage(content="You are an expert long format writer."),
        ( "user", [ 
            { "type": "image_url", "image_url": { "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } },
            { "type": "text", "text": "Write me a really really long and beautiful description of the image, it should be a least 4000 words." }
        ]),
    ])
    return prompt_template | chat

async def _transcribe_page( semaphore: asyncio.Semaphore, transcribe_chain: Runnable, delay: float) -> None:
    async with semaphore:
        # VLLM seems to complain if you hit it with too many at once.
        # So each of the pages is staggered.
        await asyncio.sleep(delay)
        await transcribe_chain.ainvoke(input={"input": "hi"})

async def _transcribe_pages_async( transcribe_chain: Runnable, num_parallel: int, delay: int) -> None:
    semaphore = asyncio.Semaphore(num_parallel)

    tasks = []
    for page_number in range(1, 101):
        tasks.append(_transcribe_page(
            semaphore=semaphore,
            transcribe_chain=transcribe_chain,
            delay=delay * (page_number % num_parallel),
        ))

    await asyncio.gather(*tasks)


def transcribe_pages( transcribe_chain: Runnable, num_parallel: int=2, delay=2) -> None:
    try:
        loop = asyncio.get_event_loop()
        if loop.is_closed():
            loop = asyncio.new_event_loop()
            asyncio.set_event_loop(loop)
    except RuntimeError:
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)

    loop.run_until_complete(_transcribe_pages_async(transcribe_chain, num_parallel, delay))


if __name__ == "__main__":
    # Set this to 1 for a synchronous solution and higher for async.
    num_parallel = 3

    # Set this to 0 for true async or provide a float for n seconds staggering.
    delay = 2

    while True:
        transcribe_pages(get_transcribe_chain(), num_parallel, delay)

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Oct 11 '24 10:10 SeanIsYoung

Thanks for reporting this! Can you profile the code and see where the bottleneck is occurring?

Oct 11 '24 14:10 DarkLight1337

cc @ywang96

Oct 11 '24 14:10 DarkLight1337

I think I've run the profiler. At least I followed the documentation here and found the start_profile and stop_profile routes.

Unffortunately I haven't been able to view the logs. I've tried both perfetto and tensorboard. perfetto keeps throwing a segmentation fault when I try to load the log file and tensorboard was throwing an error TypeError: MessageToJson() got an unexpected keyword argument 'including_default_value_fields'. I tried again after installing tensorboard from source and it just says No dashboards are active for the current data set..

I don't know if this is something I'm doing wrong, a bug with vllm or a bug with the analysis engines. Either way I can't actually view the traces.

If you want to have a look I've uploaded the traces here

Oct 14 '24 11:10 SeanIsYoung

I'm able to read the logs, but since the logfile is so huge, it takes quite a while to load, and outright crashes when I try to view the details from stack trace. After installing via pip install torch_tb_profiler, I ran the command tensorboard --logdir=<directory_containing_the_json_file>.

Oct 14 '24 13:10 DarkLight1337

That worked, thanks.

You've probably already seen this then, but in case it's useful to anyone else:

Staggered / Baseline:

Async / Experimental:

Oct 14 '24 14:10 SeanIsYoung

We also need the stack trace info since those graphs only keep track of pytorch operations. I suspect the overhead is in the vllm engine itself.

Oct 14 '24 14:10 DarkLight1337

Assuming I've done the right thing using VLLM_TRACE_FUNCTION=1, the stack traces have been uploaded as well, though I'm afraid I have no idea how to pull anything useful from them.

I've also had a play around with v0.6.3 and the problem is still largely still present (with both openbmb/MiniCPM-V-2_6 and Qwen/Qwen2-VL-7B-Instruct). However, this version seems to have introduced lower memory usage (at least in this usecase). And under the original conditions with 3 async calls the slow down is no longer present, but after increasing the number of async calls it eventually returns. Looking at the memory usage it appears that the slow down only appears when vllm starts making use of the shared gpu memory (system ram pretending it's vram). But as before, the curious thing is that staggering the calls uses significantly less ram, to the point that I actually run out of kv_cache before it can start using the shared memory.

Oct 15 '24 13:10 SeanIsYoung

Also cc @youkaichao regarding this, I don't think this has anything to do with broadcasting though since TP/PP isn't being used here.

Oct 15 '24 14:10 DarkLight1337

Assuming I've done the right thing using VLLM_TRACE_FUNCTION=1, the stack traces have been uploaded as well, though I'm afraid I have no idea how to pull anything useful from them.

I've also had a play around with v0.6.3 and the problem is still largely still present (with both openbmb/MiniCPM-V-2_6 and Qwen/Qwen2-VL-7B-Instruct). However, this version seems to have introduced lower memory usage (at least in this usecase). And under the original conditions with 3 async calls the slow down is no longer present, but after increasing the number of async calls it eventually returns. Looking at the memory usage it appears that the slow down only appears when vllm starts making use of the shared gpu memory (system ram pretending it's vram). But as before, the curious thing is that staggering the calls uses significantly less ram, to the point that I actually run out of kv_cache before it can start using the shared memory.

~~Sorry, I meant the trace information in the original logs via start_profile and stop_profile. Let me try extracting a subset of the logs and see if they can load on TensorBoard. (Seems that the "Trace" view is empty, I'm also quite new to this so not sure why this is the case...)~~ Seems that even the trace information only includes PyTorch op/module calls, perhaps we should run a regular Python profiler instead.

Oct 15 '24 15:10 DarkLight1337

Had a go using py-spy, think that might be more what you were after. Uploaded the results with the rest. The actual svg files are interactive so they provide a bit more information than what's shown below. But the speedscope file's are probably more informative.

To look at the speedscope files you'll either need to install speedscope or use the web app

staggered async

Oct 16 '24 09:10 SeanIsYoung

Thanks for providing info! This shows that sampling is taking much longer in the async case. Perhaps there is an issue in the multi-step scheduler? @comaniac @afeldman-nm are you available to look into this?

Oct 16 '24 11:10 DarkLight1337

@SeanIsYoung do you get a similar issue if you don't input any multimodal data? Our multimodal models should accept text-only input so you just have to leave out the images.

Oct 16 '24 11:10 DarkLight1337

@DarkLight1337 Using the same model and text only input I can't get it to do this. The async and staggered versions both appear to have similar memory usage.

I'm also unable to push it to the point that the GPU starts using it's shared memory, the KV cache runs out of space first limiting the number of active calls, and I have that set as high it will allow on my system.

Oct 16 '24 12:10 SeanIsYoung

Based on the command I don't think multi-step scheduling is enabled, and AFAIK async output processor is disabled when enforcing eager mode. The huge sampling time in the profile may not be accurate, because we launched CUDA kernels in non-blocking fashion, and the sampler has a sync point (aka barrier).

For "3 manually started synchronous instances", did you launch 3 endpoints and adjust the GPU memory utilization to about 30% per instance, or you have 3 GPUs?

In the case that staggered version works, I have an impression that vLLM request queue is not that efficient (probably due to single thread asyncio event loop). So I implemented another queue outside vLLM to prevent my script from sending too many requests at once.

Oct 18 '24 19:10 comaniac

For "3 manually started synchronous instances", did you launch 3 endpoints and adjust the GPU memory utilization to about 30% per instance, or you have 3 GPUs?

No, sorry if I didn't make that clear. All of them were run with only a single instance of the vLLM server running.

What I meant here was that I had a script that would iterate through a directory making calls one at a time to vLLM for each image (The script itself having no async or anything, just a for loop). I then opened another 2 terminal windows and started another two instances of the script pointing at different directories.

Oct 18 '24 19:10 SeanIsYoung

I see, so 3 sync instances are actually having 3 processes sending requests, and each of them sends requests sequentially. A more common term for this use case is "concurrency". Then to me it's hard to make a comparison between your async and sync, because from the engine's point of view, the only difference is "QPS" (query per second). I suspect your issue comes from the overhead caused by too high QPS. One thing you can try:

Launch a server. Have 3 processes sending requests sequentially. Monitoring server log and observe Running: xx reqs and Pending: yy reqs.
Re-launch a server. Have 3 processes sending requests asynchronously. Monitoring server log and observe Running: aa reqs and Pending: bb reqs.

Then you can compare these values over time, where the number of running requests is the batch size, and the number of pending requests is the queue length. Basically we need to know which number changes significantly when moving from sync to async.

Oct 18 '24 20:10 comaniac

Yeah you're right, in the end there's not much difference between the 3 sync, staggered async and async. As far as the server is concerned it still ends up running 3 requests concurrently. It's just whether or not it is passed the 3 requests all at the same time, or if they are staggered (as an aside the 3 sync is essentially the same as the staggered, it's just that the delay is the time it takes me to swap to a different terminal and start up the next one).

Consequently the only difference with the running group is that you might see it at 1 or 2 initially when the calls are staggered, but it will still eventually end up running the 3 requests concurrently, they just won't have started at the same time. That I've seen, the only time anything is put into the pending queue is when the kv cache is full, otherwise they go straight to the running group.

I don't know enough about how vLLM works under the hood to be certain. But I'm guessing that maybe there are a couple things combining here. Firstly, when I send the requests asynchronously / all at the same time, I'm guessing that all the preprocessing is done at the same time maybe, using the same shared resource space, and because they're done at the same time this resource space grows larger than when they are staggered. This guess comes from the fact that for the same number of requests the VRAM usage is higher when the calls are completely async than when they are staggered. And that by looking at speedscope it seems as though all the preprocessing is done in one go for the async situation whereas the staggered seems to have a separate preprocessing call for each request (it could be this is wrong due to monitoring an async process, but I don't know).

Staggered:

Async:

Having run it with 0.6.3 where memory usage is a bit lower I saw that I could run them completely asynchronously with no slow-down, so long as there few enough requests to keep memory requirements within the VRAM. But when those requests push the memory usage into the shared memory it starts to slow down (Which seems reasonable / plausible given that the RAM is slower and may require copying data between it and the GPU).

But when you go back to making fewer requests the slow-down is still present, indicating that either some flag has been set, or maybe now that the cache has expanded into the shared memory, it won't shrink back out of the shared memory.

But I could also be entirely wrong.

Oct 18 '24 22:10 SeanIsYoung

So your "sync" is not really "sync"...it's really confusing.

Then what I can think of in summary is batch size 3 has lower throughput than batch size 1, because when you send requests with some delays (you called it "staggered"), vLLM prefills (and encodes images) one request at a time; while when you send requests at once (you called it "async"), vLLM processes them together (batch size = 3). If your requests have a long prompt length, this may be the case. For text models, we solved this problem with chunked prefill, but this is not merged yet for MM models.

Oct 18 '24 22:10 comaniac

So your "sync" is not really "sync"...it's really confusing.

Sorry about that, I was talking in terms of the individual scripts.

Then what I can think of in summary is batch size 3 has lower throughput than batch size 1, because when you send requests with some delays (you called it "staggered"), vLLM prefills (and encodes images) one request at a time; while when you send requests at once (you called it "async"), vLLM processes them together (batch size = 3). If your requests have a long prompt length, this may be the case. For text models, we solved this problem with chunked prefill, but this is not merged yet for MM models.

I think this makes sense.

Oct 19 '24 15:10 SeanIsYoung

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Jan 18 '25 01:01 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Feb 17 '25 02:02 github-actions[bot]