Woosuk Kwon

Results 284 comments of Woosuk Kwon

Currently, vLLM does not support ONNX models. Supporting Falcon is on our roadmap. Thanks for your suggestion.

@MotzWanted I'm working on it now. I think we can add less-optimized version of Falcon (MQA replaced by MHA) quickly (within a few days) and then develop kernels to make...

I think no change is needed on the vLLM side. You can simply combine the additional weights in LoRA with the pertained model weights. Then the resulting model has the...

@asalaria-cisco Thanks for the further explanation! You're right. I also agree that that's a cool feature to have. Currently, we are focusing on fixing bugs and adding new models. After...

Thanks for your interest! vLLM is an inference and serving engine/backend like FasterTransformer, but is highly optimized for serving throughput. We provide FastAPI and OpenAI API-compatible servers for convenience, but...

@ElizabethCappon @sharlec @dcruiz01 @bashirsouid Thanks for reporting the bug. It seems the way vLLM parses the NVCC version does not work under some environments. While I'm investigating the issue, could...

@hxssgaa It's an ABI error. Could you run `pip uninstall torch` and then re-install vllm again?

Hi @flyman3046, thanks for trying out vLLM! Could you try this ```python llm.generate(my_dataaset, sampling_params, ignore_eos=True) ``` instead of the for loop? In fact, the `LLM` class internally maintains a queue...

@flyman3046 Thanks for sharing your experience! We use a more sophisticated batching mechanism than the traditional batching mechanism. In short, vLLM does not wait until all the sequences in a...

@junior-zsy We are prioritizing some small PRs and bug fixes since we are planning to release v0.2.3 in a few days. I think I will be able to review the...