Antoni Baum
Antoni Baum
Hugging Face Optimum's BetterTransformer replaces model layers with its own. Those layers have multiple attributes set to None, which cause the following exception when we use AutoTP with `deepspeed.init_inference` on...
Follows best practices and ensures easier subclassing.
# What does this PR do? For consistency and ease of use (you can just run `make` to install vllm without any extra steps). Fixes # (issue) ## Before submitting...
This PR adds support for running multiple LoRA adapters in a single batch in a similar fashion to the S-LoRA/punica projects. WIP: - I want to clean up the code...
### Describe the feature you want to add to this project https://github.com/pycaret/pycaret/pull/3170 started work on isolating properties, but the state is still not ideal. We should look into ways to...
This PRs makes the startup time for LoRA models much lower by reusing the CPU dummy LoRA used for memory profiling, which creation time is non-trivial. This doesn't impact any...
This PR makes use of the new `ray_remote_args_fn` API added to Ray Data to allow for tensor parallelism when conducting batch inference with vLLM and Ray Data. FIX https://github.com/vllm-project/vllm/issues/4410 ---...
Hey folks, awesome and really impactful work with the repo and the paper. I was wondering what was the reason for switching from the original `bgmv` kernel to a CUTLASS-based...
Small optimization for CUDA graph use cases. According to profiling, this shaves off ~10% of kernel execution time for empty queries.