vllm
vllm copied to clipboard
A high-throughput and memory-efficient inference and serving engine for LLMs
This PR aims to integrate CodeGen. Work in progress, not ready.
This PR is for accelerating LLaMA model weights loading with safetensors. I find current load weight implementation doubles the time cost as the tensor-model parallelism increases (refer to the belowing...
ExLlama (https://github.com/turboderp/exllama) It's currently the fastest and most memory-efficient executor of models that I'm aware of. Is there an interest from the maintainers in adding this support?
Got this error with pip (`pip install vllm`): ``` error: subprocess-exited-with-error × Getting requirements to build wheel did not run successfully. │ exit code: 1 ╰─> See above for output....
Hi, will vllm support 8bit quantization? Like https://github.com/TimDettmers/bitsandbytes In HF, we can run a 13B LLM on a 24G GPU with `load_in_8bit=True`. Although PageAttention can save 25% of GPU memory,...
For getting structured outputs from custom-finetuned LLMs, extensive use of [constrained decoding](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.DisjunctiveConstraint) is standard. Is there a plan to add support for DisjunctiveConstraint (and others) to vLLM in the near...
If I exceed the token limit of 4096, the vLLM abruptly stops. It would be helpful if you could incorporate some logging functionality into the stopping code. This way, users...
I have trained falcon 7b model with qlora but the inference time for outputs is too high.So I want to use vllm for increasing the inference time for that I...
reference to issue https://github.com/vllm-project/vllm/issues/198
I use benchmark_serving as client, api_server for vllm, text_generation_server for TGI, the client cmd is listed below: " python benchmark_serving.py --backend tgi/vllm --tokenizer /data/llama --dataset /data/ShareGPT_V3_unfiltered_cleaned_split.json --host 10.3.1.2 --port 8108...