vllm [V1][Performance] Implement custom serializaton for MultiModalKwargs [Rebased]

FIX #16185 (link existing issues this PR will resolve)

This is a rebase of #16279 which had too entangled commits. Implements additional handling of MultimodalKwargs on top of #13790 Further improves memory usage on top of improvements in #16273 by another 50%

Apr 10 '25 21:04 p88h

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Apr 10 '25 21:04 github-actions[bot]

@p88h This is amazing! Have you tried running some benchmarks to see the throughput performance impact of this PR?

Apr 10 '25 21:04 ywang96

@ywang96 I've added a benchmark table to the linked bug #16185

My benchmark focused on memory performance rather than throughput, and only used a single model. It should not really change throughput that much other than in cases that do run into memory issues, though.

I'll try running some throughput checks tomorrow

Apr 10 '25 23:04 p88h

Also, I haven't looked closely at the entire flow, but in the case of MMKs created from items, it might make sense to defer the population of their data (via the "reduce" operations). Since that will be repeated in the receiving process and causes extra cpu and mem overhead since tensors may get stacked etc. It would be nice if there was a way for this to happen lazily but I guess that depends on how the data is later accessed.

FYI I've opened another PR to help with this: https://github.com/vllm-project/vllm/pull/16440. It should in theory help all of the cases not just the multi-proc case.

It would still be additionally beneficial to postpone doing this reduce operation until after being transferred to the engine though.

Apr 11 '25 01:04 njhill

I have some experimental data with this PR in place. Interestingly it performs much better with zero-copy disabled

In this new benchmark,` I am feeding gradually increasing document sets to the engine. Turns out custom serialization helps less than expected - I think previously it was augmented by the cache, but now all files are unique so results are a bit different.

The 'mix' performance case measures running all prompts together (15 total, with 128 images total) after they have been initially processed one-by-one, so it's expected that it's performing much better / cached.

config / benchmark case       | 4 images | 8 images | 16 images | 32 images | t.max | t.mix
------------------------------+----------+----------+-----------+-----------+-------+-------
baseline (zero-copy disabled) | 3.55 GB  | 5.11 GB  | 9.96 GB   | 22.54 GB  | 90.4s | 44.1s
baseline (zero-copy enabled)  | 3.50 GB  | 5.01 GB  | 9.87 GB   | 22.56 GB  | 75.3s | 39.4s
#16432 (zero-copy enabled)    | 3.40 GB  | 4.75 GB  | 8.53 GB   | 22.02 GB  | 13.8s | 36.1s
#16432 (zero-copy disabled)   | 3.28 GB  | 3.95 GB  | 4.76 GB   | 5.85 GB   | 14.4s | 36.3s

Apr 11 '25 11:04 p88h

Thanks @p88h, the benchmark scripts look great!

Interestingly it performs much better with zero-copy disabled

In your table, you have the first two rows with disabled/enabled, and second two enabled/disabled... just thought to check that doesn't mean that the numbers are mixed up. It appears that without this PR, zero-copy helps, but in conjunction with this PR, it hurts (memory use especially)?

~If it is indeed the case that it hurts the performance, I have an idea what the reason might be and have opened another PR to address that - it doesn't make sense to follow the zero-copy path for non-contiguous tensors since they'll incur an additional copy anyhow: https://github.com/vllm-project/vllm/pull/16492~ edit: still checking this

Apr 11 '25 15:04 njhill

Regarding the table, yes, it's sorted weird, but the contents are correct. What matters most is the last row, anyways.

Overall I think this is some problem with either zmq itself or the way we are using it, which somehow ends up allocating ~500MB per async message used to send files. Which naturally becomes problematic when sending multiple messages, zero-copy or not.

That number is eerily similar to the buffer size specified in make_zmq_socket. But that sets the RCV/SNDBUF sizes, which control the kernel level socket buffers. HWM pools were set to 0 (unbound) . Unfortunately even setting the SNDHWM to some ridiculously low value doesn't seem to affect memory consumption much.

It's all really suspect, since we're not even sending that much data through those sockets - so something, somewhere, definitely tries to allocate a very large buffer for every sent message. Still not sure what/where/why. Limiting the size of the sent messages seems to help as well, as apparently that hits some other issues.

For now, in this PR, the value of zero-copy-threshold is set to an absurd high value which makes it work reasonably well (but effectively turns off zero copy... and has to additionally copy memory on receive to make Tensors happy).

On the positive side, the performance is significantly improved on large inputs, and zero-copy does not seem to actually improve it even when it does fit into memory, and when it doesn't, performance suffers. On the small inputs, it seems to be just as good as it was. I can try replacing ZMQ with something different just to check whether that helps, but that's a topic for a followup PR.

Apr 11 '25 18:04 p88h

Thanks @p88h. I have a few questions...

This strange per-message usage is happening on the client/encode side?
When you say per message do you mean per call to send_multipart in the zero-copy case? Or per call to send? Since send_multipart just makes multiple send calls.
Presumably when you say per async message you mean when a bunch of different inputs are sent in a single call to LLM.generate?
It looks like you're doing all of the benchmarking via LLM.generate/chat rather than via the openai API server, is that correct?

zero-copy does not seem to actually improve it even when it does fit into memory, and when it doesn't, performance suffers.

What are the fits-into-memory and doesn't-fit-into-memory cases referring to here? And you're saying that the use of zero-copy still makes the memory use much higher and the performance worse, is that correct? I'm still a bit confused about the reason for that.

has to additionally copy memory on receive to make Tensors happy

I have another PR which should avoid the extra copy on receive by ensuring a memoryview is used when decoding: https://github.com/vllm-project/vllm/pull/16492. I don't think it matters really if the backing data for the tensors is read-only.

Apr 11 '25 19:04 njhill

Thanks @p88h. I have a few questions...

This strange per-message usage is happening on the client/encode side?

The memory usage is on the sender / encode size. The number of messages sent grows primarily with the number of multimodal elements. The elements are being sent sequentially, though - In the test I'm always handling 32 images in each phase, but they are split differently. So the memory usage in the '4 images' column is actually 8 prompts with 4 images each, 40 messages total. And so, on for '32 images' case it's 1 prompt with 32 images, and 33 messages total.

That's what triggers the memory use -> when multiple messages are being sent as a multipart request. I suppose they are sent quickly enough that the receiving end doesn't have a chance to read them, but the amount of memory used is really disproportionate to the amount of data sent.

When you say per message do you mean per call to send_multipart in the zero-copy case? Or per call to send? Since send_multipart just makes multiple send calls.

Per message in a largest send_multipart. Multiple send_multipart calls don't seem to trigger this.

Presumably when you say per async message you mean when a bunch of different inputs are sent in a single call to LLM.generate?

See above - sorry for confusion, yes, there are multiple async prompts as well, but the problem is triggered depending on number of messages on a single prompt.

It looks like you're doing all of the benchmarking via LLM.generate/chat rather than via the openai API server, is that correct?

Yes, I'm using the .generate path / offline-inference style.

zero-copy does not seem to actually improve it even when it does fit into memory, and when it doesn't, performance suffers.

What are the fits-into-memory and doesn't-fit-into-memory cases referring to here? And you're saying that the use of zero-copy still makes the memory use much higher and the performance worse, is that correct? I'm still a bit confused about the reason for that.

I mean that as long as there is enough physical memory free, the performance seems to be OK, even though it does consume a lot of RAM. That could be an indication this memory is just allocated and not necessarily used. Once the system starts running low on memory, though, the performance tanks.

has to additionally copy memory on receive to make Tensors happy

I have another PR which should avoid the extra copy on receive by ensuring a memoryview is used when decoding: #16492. I don't think it matters really if the backing data for the tensors is read-only.

Well, Torch likely complains about this for some reason. If the tensors are processed later / stacked or sliced, then that may actually become an issue.

Apr 11 '25 21:04 p88h

Spent some more time investigating this - it seems that no combination of zmq setup affects the problem.

However, I found the more specific trigger : It's not just that zmq behaves weird with multiple messages, but, specifically when those messages are sent as memoryview. If I convert that to bytes (which creates a copy, but only on one side, so it's still better than packing it with the rest of the message) the problem disappears.

Now, zmq does have some special handling of bytes - specifically, when a Frame is constructed on a bytes object, then that object is used as a backing buffer for the frame. If you provide a memoryview, it can create a copy of the buffer as frame._bytes, anytime bytes property is accessed, and one of the things that accesses it is the equality operator. Not sure where that is triggered, but it's the only clue I have now.

Now, copying once still seems a better idea than wrapping into a msgpack buffer, and then potentially having to copy on the receiving side anyway, and we have to copy non-contiguous tensors anyway, so perhaps this is a good enough compromise for now. Performance-wise it's better or equal to the previous attempt - consumes 5.9GB, but the worst-case times dropped to 13.7 / 35.9 so even faster than previously with zero-copy (that used 22GB).

Apr 13 '25 11:04 p88h

Benchmarks of the current approach:

config / prompts x images     | 8 x 4        | 4 x 8       | 2 x 16       | 1 x 32        |
------------------------------+--------------+-------------+--------------+---------------+
zero-copy read and write      | 3.4 GB 10.7s | 4.8 GB 9.2s | 8.5 GB 10.5s | 22.0 GB 14.6s |
zero-copy threshold = 256MB   | 3.3 GB 10.7s | 4.0 GB 9.3s | 4.8 GB 10.7s | 5.9 GB  13.8s |
zero-copy read, copy on write | 3.3 GB 10.7s | 4.0 GB 9.2s | 4.8 GB 10.5s | 5.9 GB  13.6s |

Each group run has 32 images total, sent in a differnt layout. Each images is 1Mpix. Each processed image being sent is 16MB.

This now seems always better than 'full' zero-copy version, and on-par memory-wise but faster than the non-zero-copy version with custom serialization.

Apr 13 '25 13:04 p88h

Thanks @p88h! I am afk most of today but will review/respond towards eod hopefully.

Apr 14 '25 17:04 njhill

Re: zmq & memoryview I've extracted the communication pattern into a separate test ... and that does not trigger the issue -- I tried making the comm pattern as close as possible to the VLLM codebase, and it's still just working happily -- with memoryview variant performing just a tiny amount better than the copying one (the small magnitude of the change is interesting - it seems that the copied buffer pages are actually passed directly via kernel space, probably in COW mode. This is demonstrated by the RSS of the sender + RSS of the receiver being significantly higher than the cumulative RSS use of the whole process, the only case that's true is when they share memory)

However, that doesn't seem to hold true for memoryview pages - they must get copied over on the receiving end. So one way or another something is copying this memory; meaning the current approach is likely the best possible for now at least. Using shared memory directly could potentially alleviate that, but that would not scale to DP over TCP cases.

Isolated zmq test : https://gist.github.com/p88h/81659b458acc244d7117d2f84f05614c

Apr 15 '25 08:04 p88h

Current results. Switched to async processing in the benchmark to improve throughput a bit, so had to re-run this for consistency (also, the asdict() removal helps the in-line case as well)

config / prompts*imgs | 8 x 4        | 4 x 8        | 2 x 16       | 1 x 32        |
----------------------+--------------+--------------+--------------+---------------+
single threaded       | 
main (baseline)       | 3.6 GB 12.1s | 5.2 GB 13.3s | 9.9 GB 24.5s | DNC / OOM     |
#16432 + in-line only | 3.4 GB 11.2s | 4.1 GB 9.3s  | 5.0 GB 10.7s | 6.1 GB  12.2s |
#16432 + zero-copy    | 3.3 GB 10.7s | 4.0 GB 9.1s  | 4.8 GB 10.0s | 5.9 GB  12.0s |

Apr 15 '25 20:04 p88h

Looks like a CI test is failing - but unfortunately the root cause is obscured (the OOM failure of the subsequent test is a result of improper cleanup after the original failure). This should hopefully be addressed by https://github.com/vllm-project/vllm/pull/11737.

In the meantime I can try running this test locally.

p.s. there's no need to keep rebasing on latest main, this just causes all the tests to start over.

Apr 16 '25 22:04 njhill

It turns out it was because sometimes MMKwargs can contain non-tensor data (specifically "second_per_grid_ts": [1.0] in this case). So I pushed an update to allow floats and ints too.

Apr 16 '25 23:04 njhill

Thank you ! I was about to go back to debugging this morning ;)

Apr 17 '25 06:04 p88h