Luka Govedič

Results 93 comments of Luka Govedič

@chanh +1 - it seems like the test was never added to CI (needs to be added manually to `.buildkite/test-pipeline.yml`). When I run the test locally, the first shape works...

Also @angelayi just noticed there's no e2e tests - could you make the existing E2E tests use no custom ops by default (tests/distributed/test_sequence_parallelism.py or something like that) as well as...

@angelayi it seems like a similar failure occurs in the distributed tests CI?

@gemini-code-assist review

No problem, thanks for letting me know! This is a draft so there's no rush, will rebase at some point when I'm back from vacation.

@cyang49 I've addressed all of your comments, could you take a final look? I also added the `Epilogues.md` doc with extended descriptions and inverted the sign of azp to be...

Tested `LLaMa-3.1-8B-FP8` locally for combinations of cutlass/non-cutlass, V0/V1, eager/cuda-graph/compiled, all work ✅

I think this is a great idea! And if we're concerned with lack of visibility into a cache miss, we can improve that separately (e.g. storing config in the cache...

Triton compile issue resolved The code is currently failing with a Triton compilation error (weird): ``` loc("/home/luka/git/vllm/vllm/attention/ops/triton_flash_attention.py":863:57): error: operand #1 does not dominate this use ``` The offending [line](https://github.com/vllm-project/vllm/blob/d6b46c4eacb7c128c4f2f897c2d46d267f71cffb/vllm/attention/ops/triton_flash_attention.py#L863): ```...

Memory issue resolved ## Triton memory issue Repro steps: ``` VLLM_USE_V1=0 python examples/offline_inference/basic/generate.py --compilation-config="{'debug_dump_path':'debug-amd','level':3,'pass_config':{'enable_attn_fusion':True}}" --model amd/Llama-3.1-8B-Instruct-FP8-KV --kv-cache-dtype fp8 ``` Works without attention fusion: ``` VLLM_USE_V1=0 python examples/offline_inference/basic/generate.py --compilation-config="{'debug_dump_path':'debug-amd','level':3,'pass_config':{'enable_attn_fusion':False}}" --model amd/Llama-3.1-8B-Instruct-FP8-KV...