Luka Govedič

Results 93 comments of Luka Govedič

> Could you give some details on speedup associated with this modification? I haven't necessarily profiled this but it's meant to enable the double-batch-overlap optimization (prototype in #18415)

> Hi, any further progress on this pr? Almost ready for review!

Currently experiencing some issues when batching (in unit test), need to investigate further.

I was able to resolve the issue, and I never encountered illegal memory accesses, just bad outputs. What setup were you using that led to the error, and can you...

## lm-eval: ### `full_cuda_graph=True`: ``` local-completions (pretrained=deepseek-ai/DeepSeek-V2-Lite,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=50,max_retries=3), gen_kwargs: (None), limit: 100.0, num_fewshot: 5, batch_size: auto |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| |-----|------:|----------------|-----:|-----------|---|----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ | 0.29|± |0.0456| |...

Benchmarking results below. There's an ITL improvement, especially at low QPS, and a major hit to TTFT because CUDA Graphs are disabled for prefill. Model: `deepseek-ai/DeepSeek-V2-Lite` ### 📊 ITL Median...

@LucasWilkinson and I spoke about this: summary is that I'll use this opportunity to slightly improve the metadata building process: - add other common params to `CommonAttentionMetadata` - create a...

Added: - @ywang96's note about LMM profile_run - @aarnphm's autotuning caching issue - @lionelvillard's lazy CUDAGraph capture proposal

> Regarding measurements you might find https://github.com/vllm-project/vllm/issues/19318 useful. Looking forward to the PR!

@mgoin I don't think I have permission to create a project, could you or @simon-mo create one?