vllm icon indicating copy to clipboard operation
vllm copied to clipboard

A high-throughput and memory-efficient inference and serving engine for LLMs

Results 2816 vllm issues
Sort by recently updated
recently updated
newest added

This PR aims to integrate CodeGen. Work in progress, not ready.

This PR is for accelerating LLaMA model weights loading with safetensors. I find current load weight implementation doubles the time cost as the tensor-model parallelism increases (refer to the belowing...

ExLlama (https://github.com/turboderp/exllama) It's currently the fastest and most memory-efficient executor of models that I'm aware of. Is there an interest from the maintainers in adding this support?

Got this error with pip (`pip install vllm`): ``` error: subprocess-exited-with-error × Getting requirements to build wheel did not run successfully. │ exit code: 1 ╰─> See above for output....

Installation

Hi, will vllm support 8bit quantization? Like https://github.com/TimDettmers/bitsandbytes In HF, we can run a 13B LLM on a 24G GPU with `load_in_8bit=True`. Although PageAttention can save 25% of GPU memory,...

For getting structured outputs from custom-finetuned LLMs, extensive use of [constrained decoding](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.DisjunctiveConstraint) is standard. Is there a plan to add support for DisjunctiveConstraint (and others) to vLLM in the near...

good first issue
feature request

If I exceed the token limit of 4096, the vLLM abruptly stops. It would be helpful if you could incorporate some logging functionality into the stopping code. This way, users...

I have trained falcon 7b model with qlora but the inference time for outputs is too high.So I want to use vllm for increasing the inference time for that I...

reference to issue https://github.com/vllm-project/vllm/issues/198

I use benchmark_serving as client, api_server for vllm, text_generation_server for TGI, the client cmd is listed below: " python benchmark_serving.py --backend tgi/vllm --tokenizer /data/llama --dataset /data/ShareGPT_V3_unfiltered_cleaned_split.json --host 10.3.1.2 --port 8108...