vllm
vllm copied to clipboard
A high-throughput and memory-efficient inference and serving engine for LLMs
Thanks for the great project. I gave a try and compared with hf's offline inference speed on 100 alpaca examples. The hardware I used is a single v100-40G GPU. Here...
It would be great if you could support fastchat-t5-3b-v1.0, which is a derivation of Flan-T5-XL model: https://huggingface.co/lmsys/fastchat-t5-3b-v1.0
I found that there is a kenel for writing subsequent optimizations in rmsnorm, and I tried to write a half-precision kernel for rms. Below is the comparison data, I tested...
I have been trying to use quantized versions of models to use my GPU whose VRAM is 6GB max. However nothing seems to work. How would I go about using...
It would be great if you could support MPT-7B and MPT-30B
May vllm can achieve the performance like fastertransformer on inference side? Just curious about the detailed optimization you're done and the goal you want to achieve. BTW, vllm really accelerate...
Hello, Thanks for the great framework for deploying LLM. Would it be possible to use a LLM model compiled with the CTranslate2 library?
As far as I know `vllm` and `ray` doesn't support `8-bit quantization` as of now. I think it's the most viable quantization technique out there and should be implemented for...
Hi there, I had a question regarding working with the API Server from the [instructions](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html) here. I am running this after running the docker command # Pull the Docker image...
Using Ray in here is considering to be an overkill. You can create a multi-process distributed environment easily using torchdist or mpi launch. Internally you can leverage NCCL or MPI...