vllm [Performance]: guided generation is very slow in offline mode

[Performance]: guided generation is very slow in offline mode

Open stas00 opened this issue 1 year ago • 17 comments

Proposal to improve performance

With a single request / online mode I'm getting:

no guided 300 tok/sec
outlines 150 tok/sec (2x slower)
lm-format-enforcer 90 tok/sec (~3x slower)

with offline mode I get:

outlines is about 10-20x slower than no guided generation
lm-format-enforcer is about 4x faster than outlines (note that it is slower than outlines for online)

for online I was using this schema:

json_template = {
    "type": "object",
    "properties": {
        "criteria": {"type": "array", "items": {"type": "string"}, "minItems": 1},
        "response": { "type": "string" }
    },
    "required": ["criteria", "response"]
}

for offline I was using an even simpler schema:


{
   "type":"object",
   "properties":{
      "name":{
         "type":"string", "minLength":2, "maxLength":5
      },
      "age":{
         "type":"integer"
      }
   },
   "required":[ "name", "age"]
}

the huge performance hit in the offline mode is very strange for both backends.

2x slow down in the online mode is pretty bad too as it's already a huge impact. The offline mode can actually tolerate 2x no problem as there is no human in the loop, but 10-20x is a way impractical.

vllm=0.6.0 and outlines==0.0.46

Sep 10 '24 02:09 stas00

vllm vllm copied to clipboard

[Performance]: guided generation is very slow in offline mode

Proposal to improve performance

vllm
vllm copied to clipboard