[Feature] Allow arbitrary logit processors
Motivation
There's some great projects out there that modify logits, mostly for guided decoding or novel sampling techniques. Supporting every single one of them will cause too much bloat and distraction, but if SGLang were to allow arbitrary logit processors then the community can plug and play their own processors.
For example, I would have interest in using [https://github.com/noamgat/lm-format-enforcer](lm format enforcer) because it allows for optional JSON fields and recursive classes (unlike outlines). The API of lm format enforcer is also clean and simple and it is simple to make custom parsers for other formats than JSON (e.g. SQL).
One way I would imagine the API to work is:
def my_logits_processor(inputs: list[int], logits: torch.Tensor) -> torch.Tensor:
...
@sgl.function
def character_gen(s, name):
s += name + " is a character in Harry Potter. Please fill in the following information about this character.\n"
s += sgl.gen("output", logits_processor: my_logits_processor)
I'm not familiar with the internals of SGLang at all, so I am just throwing out the idea of supporting an async logits processor. Often we only care about logits masks that can already be calculated without knowing the scores yet. This would be more efficient as the CPU can calculate the masks while the GPU runs the model. Right now, the lack of such implementation makes logit processors a performance bottleneck in vLLM.
An async logit processor could simply work like this:
async def my_logits_processor(inputs: list[int]) -> AsyncGenerator[torch.Tensor, torch.Tensor]:
# All the preprocessing steps here to calculate the mask in parallel
logits: torch.tensor = yield
# Apply the mask to the logits here to calculate the new logits
yield new_logits
Of course, the async approach would only work if the model's calculations and the logits processor do not run from the same python process. I'm not sure if this would be the case in SGLang's server implementation.
The added benefit of integrating it in SGLang over other inference systems is the ability to easily enable logits processors for only certain sections of the generated output.
Related resources
No response
Second this. This would allow constrained decoding library developers(like me lmao) to provide integration for sglang
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.