vllm
vllm copied to clipboard
[Performance]: guided generation is very slow in offline mode
Proposal to improve performance
With a single request / online mode I'm getting:
- no guided 300 tok/sec
outlines150 tok/sec (2x slower)lm-format-enforcer90 tok/sec (~3x slower)
with offline mode I get:
outlinesis about 10-20x slower than no guided generationlm-format-enforceris about 4x faster thanoutlines(note that it is slower thanoutlinesfor online)
for online I was using this schema:
json_template = {
"type": "object",
"properties": {
"criteria": {"type": "array", "items": {"type": "string"}, "minItems": 1},
"response": { "type": "string" }
},
"required": ["criteria", "response"]
}
for offline I was using an even simpler schema:
{
"type":"object",
"properties":{
"name":{
"type":"string", "minLength":2, "maxLength":5
},
"age":{
"type":"integer"
}
},
"required":[ "name", "age"]
}
the huge performance hit in the offline mode is very strange for both backends.
2x slow down in the online mode is pretty bad too as it's already a huge impact. The offline mode can actually tolerate 2x no problem as there is no human in the loop, but 10-20x is a way impractical.
vllm=0.6.0 and outlines==0.0.46