vllm issues

vLLM is 4x faster than HF for offline inference

2

Thanks for the great project. I gave a try and compared with hf's offline inference speed on 100 alpaca examples. The hardware I used is a single v100-40G GPU. Here...

flyman3046

Support for fastchat-t5-3b-v1.0

1

It would be great if you could support fastchat-t5-3b-v1.0, which is a derivation of Flan-T5-XL model: https://huggingface.co/lmsys/fastchat-t5-3b-v1.0

Matthieu-Tinycoaching

new model

Add half rmsnorm kernel

4

I found that there is a kenel for writing subsequent optimizations in rmsnorm, and I tried to write a half-precision kernel for rms. Below is the comparison data, I tested...

sleepcoo

What is the correct way to use quantized versions of vicuna or guanco?

4

I have been trying to use quantized versions of models to use my GPU whose VRAM is 6GB max. However nothing seems to work. How would I go about using...

armsp

Support for MPT-7B and MPT-30B

4

It would be great if you could support MPT-7B and MPT-30B

mspronesti

new model

What's the difference between vllm and triton-inference-server?

5

May vllm can achieve the performance like fastertransformer on inference side? Just curious about the detailed optimization you're done and the goal you want to achieve. BTW, vllm really accelerate...

gesanqiu

CTranslate2

2

Hello, Thanks for the great framework for deploying LLM. Would it be possible to use a LLM model compiled with the CTranslate2 library?

Matthieu-Tinycoaching

`8-bit quantization` support

6

As far as I know `vllm` and `ray` doesn't support `8-bit quantization` as of now. I think it's the most viable quantization technique out there and should be implemented for...

beratcmn

feature request

Curl requests not working

5

Hi there, I had a question regarding working with the API Server from the [instructions](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html) here. I am running this after running the docker command # Pull the Docker image...

arnavsinghvi11

Remove Ray for the dependency

9

Using Ray in here is considering to be an overkill. You can create a multi-process distributed environment easily using torchdist or mpi launch. Internally you can leverage NCCL or MPI...

lanking520

enhancement

vllm
vllm copied to clipboard

Metadata

vLLM is 4x faster than HF for offline inference

Support for fastchat-t5-3b-v1.0

Add half rmsnorm kernel

What is the correct way to use quantized versions of vicuna or guanco?

Support for MPT-7B and MPT-30B

What's the difference between vllm and triton-inference-server?

CTranslate2

`8-bit quantization` support

Curl requests not working

Remove Ray for the dependency

← Metadata

Owner

Metadata

vllm vllm copied to clipboard

Metadata

← Metadata

Owner

Metadata

vllm
vllm copied to clipboard