Simon Mo
Simon Mo
Adding @njhill who initially added this.
Here's my recommendation: * We can sort and add the github organization prefix so we have fully qualified names such as `vllm-project/production-stack`, `vllm-project/aibrix`, and `kubernetes-sigs/lws`, etc.
The error is saying that the amount of memory available will not be able to handle such large context.
We just merged EPLB from @abmfy (cc @WoosukKwon). Please rebase and we would love to expose this core metrics!
I will release a new version once this is fixed... @dtrifiro I think this is due to the fact I hard coded version in `setup.py`'s so no _version.py is generated....
I will make a patch release once https://github.com/vllm-project/vllm/pull/9375 merged
Because the weights themselves aren't fp8 quantized yet, Scout can only run with bf16 weights. However, we do support dynamic quantization of the KV cache via `--kv-cache-dtype fp8`. Team from...
Test failed https://buildkite.com/vllm/ci/builds/14971/canvas?sid=01957347-78b6-407a-921f-c7a82847a7ed#01957347-7a52-4f3d-972b-78decfdc6577/206-12701
> an existing API with a batch request like you do with the OpenAI Batch API. @w013nad (or others), please feel free to open an RFC for this to discuss...
@sylviayangyy @zeroorhero thank you for your interests! Yes. @KuntaiDu has created a #feat-kvcache-offloading to discuss that.