Luka Govedič comments

Results 93 comments of


                                            Luka Govedič

[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass

> to make it work for v1, maybe we can stick to the full-graph approach, then we can have this fusion optimization together with cudagraph. Yeah I think that's one...

[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass

I agree that would be too tricky but I'm thinking we put the quant nodes (there's just 1 or 2) into the split item. So we just let the custom...

[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass

`VLLM_USE_V1=0 ... -O '{"pass_config":{"enable_attn_fusion": false}}'`: ```console |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| |-----|------:|----------------|-----:|-----------|---|----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ | 0.74|± |0.0441| | | |strict-match | 5|exact_match|↑ | 0.70|± |0.0461| ``` `VLLM_USE_V1=0...

[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass

@gemini-code-assist review

[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass

Perf results below. It seems like decode performance (ITL) is heavily improved (2-10%) and prefill is worse. Will investigate prefill after this PR. ### 📊 ITL Median (ms) | Source...

[Bug]: `triton_scaled_mm` never used on ROCm

This should likely be done as part of the refactoring mentioned in #11785 to use the `ScaledMMKernel` abstraction for FP8 kernels

[Bug]: `triton_scaled_mm` never used on ROCm

Not stale

[Bug]: `triton_scaled_mm` never used on ROCm

Yes! I actually started some work on this you might find useful in #19434. Also take a look at #8913 to understand the broader goal of the refactor.

[Bug]: `triton_scaled_mm` never used on ROCm

@shivampr please go ahead! And let me know if you need any help

[Bug]: `triton_scaled_mm` never used on ROCm

I usually develop bare meta but the rocm/vllm-dev container should work too. Can you create issues for the problems you're running into? You can also post in #sig-amd in the...