AlpinDale
AlpinDale
Not fully optimized, as a lot of the sm_100 codepath is still used for this. Tested with [alpindale/Ling-mini-2.0-NVFP4](https://huggingface.co/alpindale/Ling-mini-2.0-NVFP4), it gets about 91 tok/s decode (slower than the 140 tok/s with...
For moondream3 support, in a later PR.
We don't really use all-gather all that much, but for context parallel, all-gather is used quite a lot. This adds a fair bit of overhead when doing Context Parallelism, sometimes...
Adds a new pattern to the sequence parallelism pass to support activations like SiLU and GELU. This transforms "AllReduce -> Activation" into "ReduceScatter -> Activation -> AllGather", enabling further fusion...
Still a WIP. Need to build triton from source. ```sh $ apt install zlib1g-dev $ git clone https://github.com/triton-lang/triton.git && cd triton $ uv pip install -r python/requirements.txt $ uv pip...
Really slow at the moment, will investigate. V1 only. Launch with `APHRODITE_ATTENTION_BACKEND=SAGE_ATTN`