outlines-core icon indicating copy to clipboard operation
outlines-core copied to clipboard

Update the integration of `outlines-core` in vLLM

Open rlouf opened this issue 11 months ago • 2 comments

The integration of outlines-core in vLLM currently performs a lot worse than it could. We can improve the performance in two ways:

  1. Bypass outlines and use the latest version of outlines-core which shows better compilation performance.
  2. The logits masking is currently very inefficient; it allocates new memory for the mask and allowed tokens at every step.

If those don't bring the speed on par with xgrammar we need to understand exactly why that is.

rlouf avatar Jan 24 '25 20:01 rlouf

I profiled the logits processing code, and the bottleneck is the transfer of the allowed token ids list (which can have many elements) to GPU. My suggestion is to use a compressed version of the list that can be efficiently uncompressed/used to mask logits on GPU, for instance bitmaps.

We should first evaluate the potential speed-ups in Python; if the bottleneck becomes the bitmap construction we could move it to Rust, if it is the operations on GPU we can implement a CUDA kernel to mask the logits.

rlouf avatar Jan 28 '25 09:01 rlouf

Is there a corresponding issue in the VLLM repository? Also, as mentioned here, tracking performance would really help with reasoning through these kinds of issues, wouldn’t it?

yvan-sraka avatar Jan 28 '25 18:01 yvan-sraka