lighteval icon indicating copy to clipboard operation
lighteval copied to clipboard

Async vllm

Open clefourrier opened this issue 8 months ago • 12 comments

Adds the option to use the new AsyncVLLM from vllm v1. It supports DP + PP/TP, but not setting the batch size, and deploys an independent async VLLM model which manages requests on its own through the async engine.

Thanks to the kind people at vllm https://github.com/vllm-project/vllm/issues/17385, I realised we actually have to use a single event loop for async models, so this is what it does - it's also very fast now.

clefourrier avatar Apr 28 '25 14:04 clefourrier

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Linked to #670

clefourrier avatar Apr 28 '25 17:04 clefourrier

awesome !

ScottHoang avatar Apr 28 '25 19:04 ScottHoang

Or @lewtun if you want to take a look?

clefourrier avatar Apr 29 '25 12:04 clefourrier

Btw, interestingly I suspect that our current other async model calls (like the tgi ones) were only successful because we were only implementing one type of async loop - we might want to unify them in a later PR

clefourrier avatar Apr 29 '25 17:04 clefourrier

Re batch size = 1, batch size of more is simply not supported yet in the generate method of the AsyncLLM model (unless I'm reading this comment wrong ^^)

clefourrier avatar Apr 30 '25 07:04 clefourrier

Re batch size = 1, batch size of more is simply not supported yet in the generate method of the AsyncLLM model (unless I'm reading this comment wrong ^^)

Ah I see, so in the async version one cannot pass a list of prompts. It would be interesting to benchmark one of the pass@1 evals like AIME24 with a DeepSeek-Distill model to get a sense for how much of a speed difference this makes (mostly asking to see if we adopt it in open-r1

lewtun avatar Apr 30 '25 08:04 lewtun

On it, so DP2 with ray vs with async?

clefourrier avatar Apr 30 '25 09:04 clefourrier

Ok so I found a fun thing - I assumed that the .generate method actually sent back the total generation but... it does not - you need to iterate on it to get a generation token per token. (I was getting 0 all day and empty text on my generative evals). So do not merge, I still need to do some time estimates now ^^"

clefourrier avatar Apr 30 '25 16:04 clefourrier

There's an issue on the pass @ metrics that I need to investigate as they are failing

clefourrier avatar Apr 30 '25 18:04 clefourrier

On it, so DP2 with ray vs with async?

That would be a good test! Even better would be DP=8 if you can get a free node :)

lewtun avatar May 01 '25 07:05 lewtun

Spent my day on this, I suspect I'm either missing something extremely trivial in how the sampling is done in the async vllm or how I should gather results - I need to start working on stg else so feel free to take a look if it's urgetn

clefourrier avatar May 05 '25 16:05 clefourrier

@NathanHB going to merge to avoid diverging from main too much, though there might be issues with sampling evals (added a warning for now, + created an issue to investigate)

clefourrier avatar May 22 '25 10:05 clefourrier