Update the integration of `outlines-core` in vLLM
The integration of outlines-core in vLLM currently performs a lot worse than it could. We can improve the performance in two ways:
- Bypass
outlinesand use the latest version ofoutlines-corewhich shows better compilation performance. - The logits masking is currently very inefficient; it allocates new memory for the mask and allowed tokens at every step.
If those don't bring the speed on par with xgrammar we need to understand exactly why that is.
I profiled the logits processing code, and the bottleneck is the transfer of the allowed token ids list (which can have many elements) to GPU. My suggestion is to use a compressed version of the list that can be efficiently uncompressed/used to mask logits on GPU, for instance bitmaps.
We should first evaluate the potential speed-ups in Python; if the bottleneck becomes the bitmap construction we could move it to Rust, if it is the operations on GPU we can implement a CUDA kernel to mask the logits.
Is there a corresponding issue in the VLLM repository? Also, as mentioned here, tracking performance would really help with reasoning through these kinds of issues, wouldn’t it?