Simon Mo
Simon Mo
I need to fix this to get CI working. Got it working now, running some tests and will ask for review.
I fixed the variable names, but currently facing weights naming mismatch (phi renamed some weights). I will skip this test in CI first and come back to this.
Superseded by #2428
Thanks! This feature is indeed needed but we are actively evaluating Outlines as it seems to be higher performance for serving because it pre-compile all the logit masks. I'll continue...
Outlines integration has been added. Now the general structure is in place. We welcome PR that adapt to lm-format-enforcer backend as well.
Please let me know once this PR is updated, or a new PR!
> Of course, I can send multiple seperate requests, but those are handled sequentially and do not benefit from speed improvements. This is not correct. vLLM automatically batches in-flight requests....
Further illustrated here, hope the explanation is helpful: https://github.com/vllm-project/vllm/issues/1636#issuecomment-1816831493
Ah one more thing, if you observing sequential behavior, try correct main branch instead of released version. Or turn on the flag `--engine-use-ray`. In the released version, our AsyncLLMEngine is...
v0.2.2 was released last night. It should include the change. Please try it out and let us know!