bobbych94

Results 16 comments of bobbych94

> > Can you discuss why is this the case? If possible i would really appreciate that we get the first iteration working with cuda graph. > > @simon-mo The...

> @robertgshaw2-neuralmagicDCA 需要旋转嵌入层生成的三个查询(`query`、`query_succ`、`query_inter`)。它们无法在 中计算`Attention`。 一种可能的解决方案是将三个查询堆叠为一个,这将改变 的形状`query`。这样可以吗?@WoosukKwon你有什么主意吗? / > @robertgshaw2-neuralmagic DCA requires three queries (`query`, `query_succ`, `query_inter`) produced by the rotary embedding layer. They cannot be computed in `Attention`. A possible...

> @nanmi The functions [`_bruteforce_dynamic_chunk_flash_attn_varlen_func`](https://github.com/hzhwcmhf/vllm/blob/7653ec3067bfb4782e99121520603409ef739725/vllm/attention/backends/dual_chunk_flash_attn.py#L626) for prefill and [`_bruteforce_dynamic_chunk_pageattention_forward_decode`](https://github.com/hzhwcmhf/vllm/blob/7653ec3067bfb4782e99121520603409ef739725/vllm/attention/backends/dual_chunk_flash_attn.py#L713) for decoding could indeed be optimized through CUDA kernel implementations similar to Flash Attention. > > Taking `_bruteforce_dynamic_chunk_flash_attn_varlen_func` as an example,...

> Hi [@arcadia-ai](https://github.com/arcadia-ai)! You can specify the browser location, take a look at this example: > > from pydoll.browser.chrome import Chrome > from pydoll.browser.options import Options > > async def...

> [@nanmi](https://github.com/nanmi) Could you set `NCCL_DEBUG=INFO` to obtain more debugging information before NCCL crashes and paste it to here? This is my running code: I use H20 96Gx8 extra-llm-api-config-deepseek_h20.yml ```yaml...