Chanh Nguyen comments

Results 5 comments of


                                            Chanh Nguyen

[Core] Support full cuda graph in v1

> @chanh thanks for the PR, I have tested llama 8b on my side with your PR and I see ~7% improvement for TPOT. Great work! > > Before PR:...

[Core] Support full cuda graph in v1

> Actually, what's `max_seq_len` and `max_query_len` at the capture time? If it's 0, I guess this could cause a bug. `max_query_len` is the `num_tokens` being passed to `_dummy_run` which is...

[Core] Support full cuda graph in v1

> I think we may need to disable ahead-of-time scheduling for FA3 when using full cuda-graph: > > https://github.com/vllm-project/vllm/blob/1a6af1453d2077832c3d5e8bcd60a5ef6a95e46b/vllm/v1/attention/backends/flash_attn.py#L341-L354 > > since this scheduler may choose a different number of...

[Core] Support full cuda graph in v1

> > I think we may need to disable ahead-of-time scheduling for FA3 when using full cuda-graph: > > https://github.com/vllm-project/vllm/blob/1a6af1453d2077832c3d5e8bcd60a5ef6a95e46b/vllm/v1/attention/backends/flash_attn.py#L341-L354 > > > > since this scheduler may choose a...

[Core] Support full cuda graph in v1

> > > I think we may need to disable ahead-of-time scheduling for FA3 when using full cuda-graph: > > > https://github.com/vllm-project/vllm/blob/1a6af1453d2077832c3d5e8bcd60a5ef6a95e46b/vllm/v1/attention/backends/flash_attn.py#L341-L354 > > > > > > since this...