Luka Govedič
Luka Govedič
> to make it work for v1, maybe we can stick to the full-graph approach, then we can have this fusion optimization together with cudagraph. Yeah I think that's one...
I agree that would be too tricky but I'm thinking we put the quant nodes (there's just 1 or 2) into the split item. So we just let the custom...
`VLLM_USE_V1=0 ... -O '{"pass_config":{"enable_attn_fusion": false}}'`: ```console |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| |-----|------:|----------------|-----:|-----------|---|----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ | 0.74|± |0.0441| | | |strict-match | 5|exact_match|↑ | 0.70|± |0.0461| ``` `VLLM_USE_V1=0...
@gemini-code-assist review
Perf results below. It seems like decode performance (ITL) is heavily improved (2-10%) and prefill is worse. Will investigate prefill after this PR. ### 📊 ITL Median (ms) | Source...
This should likely be done as part of the refactoring mentioned in #11785 to use the `ScaledMMKernel` abstraction for FP8 kernels
Yes! I actually started some work on this you might find useful in #19434. Also take a look at #8913 to understand the broader goal of the refactor.
@shivampr please go ahead! And let me know if you need any help
I usually develop bare meta but the rocm/vllm-dev container should work too. Can you create issues for the problems you're running into? You can also post in #sig-amd in the...