bobbych94 comments

Results 16 comments of


                                            bobbych94

[Model] Implement DualChunkAttention for Qwen2 Models

> > Can you discuss why is this the case? If possible i would really appreciate that we get the first iteration working with cuda graph. > > @simon-mo The...

[Model] Implement DualChunkAttention for Qwen2 Models

> @robertgshaw2-neuralmagicDCA 需要旋转嵌入层生成的三个查询（`query`、`query_succ`、`query_inter`）。它们无法在中计算`Attention`。一种可能的解决方案是将三个查询堆叠为一个，这将改变的形状`query`。这样可以吗？@WoosukKwon你有什么主意吗？ / > @robertgshaw2-neuralmagic DCA requires three queries (`query`, `query_succ`, `query_inter`) produced by the rotary embedding layer. They cannot be computed in `Attention`. A possible...

[Model] Implement DualChunkAttention for Qwen2 Models

> @nanmi The functions [`_bruteforce_dynamic_chunk_flash_attn_varlen_func`](https://github.com/hzhwcmhf/vllm/blob/7653ec3067bfb4782e99121520603409ef739725/vllm/attention/backends/dual_chunk_flash_attn.py#L626) for prefill and [`_bruteforce_dynamic_chunk_pageattention_forward_decode`](https://github.com/hzhwcmhf/vllm/blob/7653ec3067bfb4782e99121520603409ef739725/vllm/attention/backends/dual_chunk_flash_attn.py#L713) for decoding could indeed be optimized through CUDA kernel implementations similar to Flash Attention. > > Taking `_bruteforce_dynamic_chunk_flash_attn_varlen_func` as an example,...

bobbych94

[Model] Implement DualChunkAttention for Qwen2 Models

[Model] Implement DualChunkAttention for Qwen2 Models

[Model] Implement DualChunkAttention for Qwen2 Models

Browser is not running!

Running into free(): double free detected in tcache 2 when using trtllm-bench in a multi-node scenario

Running into free(): double free detected in tcache 2 when using trtllm-bench in a multi-node scenario