sglang [Feature] Allow arbitrary logit processors

Motivation

There's some great projects out there that modify logits, mostly for guided decoding or novel sampling techniques. Supporting every single one of them will cause too much bloat and distraction, but if SGLang were to allow arbitrary logit processors then the community can plug and play their own processors.

For example, I would have interest in using [https://github.com/noamgat/lm-format-enforcer](lm format enforcer) because it allows for optional JSON fields and recursive classes (unlike outlines). The API of lm format enforcer is also clean and simple and it is simple to make custom parsers for other formats than JSON (e.g. SQL).

One way I would imagine the API to work is:

def my_logits_processor(inputs: list[int], logits: torch.Tensor) -> torch.Tensor:
   ...


@sgl.function
def character_gen(s, name):
    s += name + " is a character in Harry Potter. Please fill in the following information about this character.\n"
    s += sgl.gen("output", logits_processor: my_logits_processor)

I'm not familiar with the internals of SGLang at all, so I am just throwing out the idea of supporting an async logits processor. Often we only care about logits masks that can already be calculated without knowing the scores yet. This would be more efficient as the CPU can calculate the masks while the GPU runs the model. Right now, the lack of such implementation makes logit processors a performance bottleneck in vLLM.

An async logit processor could simply work like this:

async def my_logits_processor(inputs: list[int]) -> AsyncGenerator[torch.Tensor, torch.Tensor]:
   # All the preprocessing steps here to calculate the mask in parallel

   logits: torch.tensor = yield

   # Apply the mask to the logits here to calculate the new logits
   yield new_logits

Of course, the async approach would only work if the model's calculations and the logits processor do not run from the same python process. I'm not sure if this would be the case in SGLang's server implementation.

The added benefit of integrating it in SGLang over other inference systems is the ability to easily enable logits processors for only certain sections of the generated output.

Related resources

No response

Aug 11 '24 19:08 iiLaurens

Second this. This would allow constrained decoding library developers(like me lmao) to provide integration for sglang

Aug 21 '24 23:08 Dan-wanna-M

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

Oct 21 '24 01:10 github-actions[bot]