vllm
vllm copied to clipboard
A high-throughput and memory-efficient inference and serving engine for LLMs
Will it be possible to add support for GPTJ please? Any plans for this?
Do you know if support for this is planned? I may be interested in writing the custom metal kernels.
On ubuntu 20.04, Python 3.10, pip 23.1.2 Issue persisting with Python 3.8 and pip 21.2.4 ``` Collecting vllm Using cached vllm-0.1.0.tar.gz (83 kB) Running command pip subprocess to install build...
https://arxiv.org/abs/1911.02150 For example, StarCoder uses MQA to speed up inference. How does PagedAttention compare to Multi-Query Attention? Are they compatible?
BLOOM is an open-source LLM developed by BigScience. The BLOOM models have achieved high rankings in HuggingFace downloads. It'd be great to have these models in our catalog.
I successfully installed vLLM in WSL2, when I was trying to run the sample code, I got error info like this: ``` from vllm import LLM, SamplingParams prompts = [...
Maybe not too urgent, but would be nice to have echo in the OpenAI interface, this can facilitate scoring (e.g., QA dataset)
I get the following error when I try to install vllm on WSL Ubuntu. The main take away is - ```Building wheel for vllm (pyproject.toml) did not run successfully.``` Full...
I saw all are wrappers around vllm, how to integrate hf and see out-of-box boost from my existing model?