vllm icon indicating copy to clipboard operation
vllm copied to clipboard

A high-throughput and memory-efficient inference and serving engine for LLMs

Results 2816 vllm issues
Sort by recently updated
recently updated
newest added

It would be great, if you can add support for Falcon models as well! Does it support onnx models today?

new model

when will /v1/embeddings API available? Thank you

good first issue
feature request

As mentioned in the title [this simple example](https://python.langchain.com/docs/get_started/quickstart#llms) passes a list instead of a str. Raw request: ![image](https://github.com/vllm-project/vllm/assets/47108366/197b6dc3-a5b0-49f5-9568-2739de1fbd93) Error Message: `INFO: 127.0.0.1:44226 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error...

good first issue
feature request

Thanks for the repo! I can build the repo successfully on H100 machine. But when I run the benchmarks, it shows the error below: ``` FATAL: kernel `fmha_cutlassF_f16_aligned_64x128_rf_sm80` is for...

bug

Currently, pip installing our package takes 5-10 minutes because our CUDA kernels are compiled on the user machine. For better UX, we should include pre-built CUDA binaries in our PyPI...

help wanted
Installation

Will vLLM support 4-bit GPTQ models?

feature request

Will there be added support for encoder-decoder models, like T5 or BART? All of the currently supported models are decoder-only.

new model

Is support for Whisper on the roadmap? Something like https://github.com/ggerganov/whisper.cpp would be great.

new model

Based on the examples, vllm can launch a server with a single model instances. Can vllm serving clients by using multiple model instances? With multiple model instances, the sever will...

How easy or difficult it would be to support LoRA fine-tuned models? Would it need big changes to the vLLM engine or is it something that can be done at...

feature request