Simon Mo comments

Results 313 comments of


                                            Simon Mo

API causes slowdown in batch request handling

Did you turn on `engine-use-ray`?

API causes slowdown in batch request handling

Sorry about the issue and we are treating it with high priority. We are in the process of reproducing the bug on different kinds of settings. As posted before, our...

API causes slowdown in batch request handling

My conservative ETA is EOW (12/3). If you want to help look into as well, more help the better! On November 25, 2023, GitHub ***@***.***> wrote: > Any idea how...

API causes slowdown in batch request handling

Ok I spent some times on different rabbit holes. The end conclusion is as following, you are seeing undesirable performance because **vLLM's under-optimized support for AWQ models at the moment**....

API causes slowdown in batch request handling

> I tested 15 prompts without AWQ quantization, and I still get 0.5-1 second between handling each request. After the requests are handled, it starts processing the requests. Can you...

API causes slowdown in batch request handling

@jpeig, the LM format enforcer bit is good hint. Given the low generation throughput, I'm suspecting this performance bug, which they just fixed recently: https://github.com/noamgat/lm-format-enforcer/issues/28#issuecomment-1836534937

[Frontend][Core] Update Outlines Integration from `FSM` to `Guide`

Same PR works since this is small enough. also cc @njhill I think you mentioned similar issue

[Frontend][Core] Update Outlines Integration from `FSM` to `Guide`

Actually it seems complete given @maxdebayser's comment. I will merge now.

Unable to specify GPU usage in VLLM code

You can specify the devices by using `CUDA_VISIBLE_DEVICES` environment variable.

Unable to specify GPU usage in VLLM code

Try instantiate them in different script?