Amanda-Barbara

Results 22 issues of Amanda-Barbara

Hello, When I use thd qkv_format with CP to run 128*1024 sequence length with 2xA800 gpus on run_fused_attn_with_cp.py script and compare the result just using flash-attn2 algorithm. I found that...

There are precision errors compared with flash_attn_2_cuda.varlen_fwd when I use flashinfer.single_prefill_with_kv_cache function to run cohere_plus model, below is the code I used: fi_fwd_out = flashinfer.single_prefill_with_kv_cache(q.contiguous(), k.contiguous(), v.contiguous(), causal=True, sm_scale=softmax_scale, allow_fp16_qk_reduction=False)...