vllm issues

CodeGen Converter

1

This PR aims to integrate CodeGen. Work in progress, not ready.

Accelerate LLaMA model loading

1

This PR is for accelerating LLaMA model weights loading with safetensors. I find current load weight implementation doubles the time cost as the tensor-model parallelism increases (refer to the belowing...

JF-D

Feature request：support ExLlama

7

ExLlama (https://github.com/turboderp/exllama) It's currently the fastest and most memory-efficient executor of models that I'm aware of. Is there an interest from the maintainers in adding this support?

alanxmay

Cannot install neither with pip nor with poetry

Got this error with pip (`pip install vllm`): ``` error: subprocess-exited-with-error × Getting requirements to build wheel did not run successfully. │ exit code: 1 ╰─> See above for output....

dSupertramp

Installation

8bit support

9

Hi, will vllm support 8bit quantization? Like https://github.com/TimDettmers/bitsandbytes In HF, we can run a 13B LLM on a 24G GPU with `load_in_8bit=True`. Although PageAttention can save 25% of GPU memory,...

mymusise

Support for Constrained decoding

1

For getting structured outputs from custom-finetuned LLMs, extensive use of [constrained decoding](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.DisjunctiveConstraint) is standard. Is there a plan to add support for DisjunctiveConstraint (and others) to vLLM in the near...

ojus1

good first issue

feature request

Long context will cause the vLLM stop

1

If I exceed the token limit of 4096, the vLLM abruptly stops. It would be helpful if you could incorporate some logging functionality into the stopping code. This way, users...

sunyuhan19981208

Not able to used qlora models with vllm

1

I have trained falcon 7b model with qlora but the inference time for outputs is too high.So I want to use vllm for increasing the inference time for that I...

royrajjyoti1

GPT-J model

6

reference to issue https://github.com/vllm-project/vllm/issues/198

AndreSlavescu

TGI performance is better than vllm on A800

7

I use benchmark_serving as client, api_server for vllm, text_generation_server for TGI, the client cmd is listed below: " python benchmark_serving.py --backend tgi/vllm --tokenizer /data/llama --dataset /data/ShareGPT_V3_unfiltered_cleaned_split.json --host 10.3.1.2 --port 8108...

jameswu2014

vllm
vllm copied to clipboard

Metadata

CodeGen Converter

Accelerate LLaMA model loading

Feature request：support ExLlama

Cannot install neither with pip nor with poetry

8bit support

Support for Constrained decoding

Long context will cause the vLLM stop

Not able to used qlora models with vllm

GPT-J model

TGI performance is better than vllm on A800

← Metadata

Owner

Metadata

vllm vllm copied to clipboard

Metadata

← Metadata

Owner

Metadata

vllm
vllm copied to clipboard