vllm
vllm copied to clipboard
A high-throughput and memory-efficient inference and serving engine for LLMs
It would be great, if you can add support for Falcon models as well! Does it support onnx models today?
when will /v1/embeddings API available? Thank you
As mentioned in the title [this simple example](https://python.langchain.com/docs/get_started/quickstart#llms) passes a list instead of a str. Raw request:  Error Message: `INFO: 127.0.0.1:44226 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error...
Thanks for the repo! I can build the repo successfully on H100 machine. But when I run the benchmarks, it shows the error below: ``` FATAL: kernel `fmha_cutlassF_f16_aligned_64x128_rf_sm80` is for...
Currently, pip installing our package takes 5-10 minutes because our CUDA kernels are compiled on the user machine. For better UX, we should include pre-built CUDA binaries in our PyPI...
Will there be added support for encoder-decoder models, like T5 or BART? All of the currently supported models are decoder-only.
Is support for Whisper on the roadmap? Something like https://github.com/ggerganov/whisper.cpp would be great.
Based on the examples, vllm can launch a server with a single model instances. Can vllm serving clients by using multiple model instances? With multiple model instances, the sever will...
How easy or difficult it would be to support LoRA fine-tuned models? Would it need big changes to the vLLM engine or is it something that can be done at...