Cunxiao Du
Results
13
comments of
Cunxiao Du
same issue here
Thanks for your reply! However, based on my test case, when using group query attention, the gradient of k and v cannot pass allclose with torch implementation vs fused-attention.
flash_attn_with_kv_cache will automatically update, close the issue