Tri Dao comments

Results 440 comments of


                                            Tri Dao

trafficstars

Numerical difference between flash_attn_varlen_kvpacked_func and vanilla x-attention implementation

If your application is very sensitive to numerical error then flash-attn might not be a good fit, mainly because we only support fp16 / bf16 and not fp32.

Numerical difference between flash_attn_varlen_kvpacked_func and vanilla x-attention implementation

Then flash-attn should be more accurate than the standard implementation. You want to compare (flash-attn in bf16 - reference impl in fp32) vs (reference impl in bf16 - reference impl...

Numerical difference between flash_attn_varlen_kvpacked_func and vanilla x-attention implementation

The error seems too high, you can try `flash_attn_func` since it's simpler to call (no need to construct cu_seqlens which might be error prone). Try to make the test as...

undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE

I haven't tried with pytorch 2.2.2 but I don't see why compiling from source wouldn't work. The wheel may or may not be compatible.

undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE

looks like it's still downloading the wheel? Can you try `python3 setup.py install`?

undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE

We have new wheels (flash-attn 2.5.8) that should work with torch 2.2.2

Tri Dao

Numerical difference between flash_attn_varlen_kvpacked_func and vanilla x-attention implementation

Numerical difference between flash_attn_varlen_kvpacked_func and vanilla x-attention implementation

Numerical difference between flash_attn_varlen_kvpacked_func and vanilla x-attention implementation

undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE

undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE

undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE

Does FlashAttention 2 use memory coalescing in Nvidia GPU?

module 'flash_attn' has no attribute 'flash_attn_varlen_qkvpacked_func'

module 'flash_attn' has no attribute 'flash_attn_varlen_qkvpacked_func'

Fewer matrix multiplications, same results, should we consider adopting it?