Luka Govedič comments

Results 93 comments of


                                            Luka Govedič

[Core] Support full cuda graph in v1

@chanh +1 - it seems like the test was never added to CI (needs to be added manually to `.buildkite/test-pipeline.yml`). When I run the test locally, the first shape works...

[compile] Enable sequence parallelism matching w/o custom ops enabled

Also @angelayi just noticed there's no e2e tests - could you make the existing E2E tests use no custom ops by default (tests/distributed/test_sequence_parallelism.py or something like that) as well as...

[compile] Enable sequence parallelism matching w/o custom ops enabled

@angelayi it seems like a similar failure occurs in the distributed tests CI?

FP8 custom ops

@gemini-code-assist review

[Kernel] Add per-token AZP epilogue

No problem, thanks for letting me know! This is a draft so there's no rush, will rebase at some point when I'm back from vacation.

[Kernel] Add per-token AZP epilogue

@cyang49 I've addressed all of your comments, could you take a final look? I also added the `Epilogues.md` doc with extended descriptions and inverted the sign of azp to be...

[FP8] Refactor apply_fp8_linear and apply_fp8_linear_generic into an object

Tested `LLaMa-3.1-8B-FP8` locally for combinations of cutlass/non-cutlass, V0/V1, eager/cuda-graph/compiled, all work ✅

[RFC]: vLLM x torch.compile caching should be opt-out by default

I think this is a great idea! And if we're concerned with lack of visibility into a cache miss, we can improve that separately (e.g. storing config in the cache...

[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass

Triton compile issue resolved The code is currently failing with a Triton compilation error (weird): ``` loc("/home/luka/git/vllm/vllm/attention/ops/triton_flash_attention.py":863:57): error: operand #1 does not dominate this use ``` The offending [line](https://github.com/vllm-project/vllm/blob/d6b46c4eacb7c128c4f2f897c2d46d267f71cffb/vllm/attention/ops/triton_flash_attention.py#L863): ```...

[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass

Memory issue resolved ## Triton memory issue Repro steps: ``` VLLM_USE_V1=0 python examples/offline_inference/basic/generate.py --compilation-config="{'debug_dump_path':'debug-amd','level':3,'pass_config':{'enable_attn_fusion':True}}" --model amd/Llama-3.1-8B-Instruct-FP8-KV --kv-cache-dtype fp8 ``` Works without attention fusion: ``` VLLM_USE_V1=0 python examples/offline_inference/basic/generate.py --compilation-config="{'debug_dump_path':'debug-amd','level':3,'pass_config':{'enable_attn_fusion':False}}" --model amd/Llama-3.1-8B-Instruct-FP8-KV...