vllm
vllm copied to clipboard
A high-throughput and memory-efficient inference and serving engine for LLMs
I tried to use vllm on my finetuned model from starcoder, but its seems not supported from the official package (?) In the README.md is said to be supported. ```...
The langchain implementation sends the prompt as an array of strings to the /v1/completions endpoint. With this change, it is possible to use a simple string or an array of...
- Allow user to specify multiple models to download when loading server - Allow user to switch between models - Allow user to load multiple models on the cluster (nice...
I'm trying to run this project with the following Dockerfile: ```Dockerfile FROM nvcr.io/nvidia/pytorch:22.12-py3 RUN pip uninstall torch -y WORKDIR /workspace COPY /inference/vllm /workspace/inference/vllm WORKDIR /workspace/inference/vllm RUN pip install -e ....
 look at this, the output here continues for half a hour and never stops but nothing is generated. New request is pendding.
Are there any prompt size limits? It seems that using more than 120 words make the model unresponsive. Check the following case. In the first try I used 112 words...
hi guys, We found that infer with vllm can greatly improve performance! But we need to use LoRA(`peft`) in inference. We also found that the community has a strong demand...
In my case , I can deploy the vllm service on single GPU. but when I use multi gpu, I meet the ray OOM error. Could you please help solve...
Only works for Falcon-7B for now. The Falcon-40B model generates garbage outputs. Needs debugging.
I want to load a local model which has the same file with the files downloaded from huggingface. However, right now this repository seems to only support load from website.