Simon Mo
Simon Mo
Did you turn on `engine-use-ray`?
Sorry about the issue and we are treating it with high priority. We are in the process of reproducing the bug on different kinds of settings. As posted before, our...
My conservative ETA is EOW (12/3). If you want to help look into as well, more help the better! On November 25, 2023, GitHub ***@***.***> wrote: > Any idea how...
Ok I spent some times on different rabbit holes. The end conclusion is as following, you are seeing undesirable performance because **vLLM's under-optimized support for AWQ models at the moment**....
> I tested 15 prompts without AWQ quantization, and I still get 0.5-1 second between handling each request. After the requests are handled, it starts processing the requests. Can you...
@jpeig, the LM format enforcer bit is good hint. Given the low generation throughput, I'm suspecting this performance bug, which they just fixed recently: https://github.com/noamgat/lm-format-enforcer/issues/28#issuecomment-1836534937
Same PR works since this is small enough. also cc @njhill I think you mentioned similar issue
Actually it seems complete given @maxdebayser's comment. I will merge now.
You can specify the devices by using `CUDA_VISIBLE_DEVICES` environment variable.
Try instantiate them in different script?