Cunxiao Du

Results 13 comments of Cunxiao Du

Thanks for your reply! However, based on my test case, when using group query attention, the gradient of k and v cannot pass allclose with torch implementation vs fused-attention.

flash_attn_with_kv_cache will automatically update, close the issue