Antoni Baum issues

Results 33 issues of


                                            Antoni Baum

Handle None in `update_mp_params` to support BetterTransformer

Hugging Face Optimum's BetterTransformer replaces model layers with its own. Those layers have multiple attributes set to None, which cause the following exception when we use AutoTP with `deepspeed.init_inference` on...

Ensure classmethods use `cls` instead of the class directly

Follows best practices and ensures easier subclassing.

Update server/Makefile to include Makefile-vllm

# What does this PR do? For consistency and ease of use (you can just run `make` to install vllm without any extra steps). Fixes # (issue) ## Before submitting...

Add multi-LoRA support

This PR adds support for running multiple LoRA adapters in a single batch in a similar fashion to the S-LoRA/punica projects. WIP: - I want to clean up the code...

Improve inheritance in regards to properties in Base Experiments

### Describe the feature you want to add to this project https://github.com/pycaret/pycaret/pull/3170 started work on isolating properties, but the state is still not ideal. We should look into ways to...

enhancement

refactor

Support elastic training

enhancement

[Core] Faster startup for LoRA enabled models

This PRs makes the startup time for LoRA models much lower by reusing the CPU dummy LoRA used for memory profiling, which creation time is non-trivial. This doesn't impact any...

[Doc] Update Ray Data distributed offline inference example

This PR makes use of the new `ray_remote_args_fn` API added to Ray Data to allow for tensor parallelism when conducting batch inference with vLLM and Ray Data. FIX https://github.com/vllm-project/vllm/issues/4410 ---...

Reasons for switching to CUTLASS-based kernel instead of custom kernel

Hey folks, awesome and really impactful work with the repo and the paper. I was wondering what was the reason for switching from the original `bgmv` kernel to a CUTLASS-based...

perf: Fail fast on empty query for `BatchPrefillWithPagedKVCacheKernel`

Small optimization for CUDA graph use cases. According to profiling, this shaves off ~10% of kernel execution time for empty queries.