Woosuk Kwon comments

Results 284 comments of


                                            Woosuk Kwon

Support for Falcon-7B / 40B models

Currently, vLLM does not support ONNX models. Supporting Falcon is on our roadmap. Thanks for your suggestion.

Support for Falcon-7B / 40B models

@MotzWanted I'm working on it now. I think we can add less-optimized version of Falcon (MQA replaced by MHA) quickly (within a few days) and then develop kernels to make...

Would it be possible to support LoRA fine-tuned models?

I think no change is needed on the vLLM side. You can simply combine the additional weights in LoRA with the pertained model weights. Then the resulting model has the...

Would it be possible to support LoRA fine-tuned models?

@asalaria-cisco Thanks for the further explanation! You're right. I also agree that that's a cool feature to have. Currently, we are focusing on fixing bugs and adding new models. After...

What's the difference between vllm and triton-inference-server?

Thanks for your interest! vLLM is an inference and serving engine/backend like FasterTransformer, but is highly optimized for serving throughput. We provide FastAPI and OpenAI API-compatible servers for convenience, but...

Ubuntu pip installation issue

@ElizabethCappon @sharlec @dcruiz01 @bashirsouid Thanks for reporting the bug. It seems the way vLLM parses the NVCC version does not work under some environments. While I'm investigating the issue, could...

Ubuntu pip installation issue

@hxssgaa It's an ABI error. Could you run `pip uninstall torch` and then re-install vllm again?

vLLM is 4x faster than HF for offline inference

Hi @flyman3046, thanks for trying out vLLM! Could you try this ```python llm.generate(my_dataaset, sampling_params, ignore_eos=True) ``` instead of the for loop? In fact, the `LLM` class internally maintains a queue...

vLLM is 4x faster than HF for offline inference

@flyman3046 Thanks for sharing your experience! We use a more sophisticated batching mechanism than the traditional batching mechanism. In short, vLLM does not wait until all the sequences in a...

Add multi-LoRA support

@junior-zsy We are prioritizing some small PRs and bug fixes since we are planning to release v0.2.3 in a few days. I think I will be able to review the...