Tri Dao comments

Results 438 comments of


                                            Tri Dao

trafficstars

Add support for Cuda 12.8 and B200 GPUs

https://github.com/Dao-AILab/flash-attention/blob/b517a592049ed81a4cf9ad3aa4b4a7372e9d9a56/flash_attn/cute/flash_fwd_sm100.py

Add support for Cuda 12.8 and B200 GPUs

> Thanks! Sorry this is a stupid question. > > But to use it on b200s, what would i have to do? I followed this: > > ``` > cd...

Add support for Cuda 12.8 and B200 GPUs

I'm hearing aarch64 wheels will be coming soon (on the order of weeks).

I compared the flash_attn_func with torch.nn.functional.scaled_dot_product_attention and found that the results were not as expected. The scaled_dot_product_attention was actually faster.

Please look at existing issues on numerical error. The right thing to compare is (flashattn in fp16 - reference attn in fp32) vs (reference attn in fp16 - reference attn...

I compared the flash_attn_func with torch.nn.functional.scaled_dot_product_attention and found that the results were not as expected. The scaled_dot_product_attention was actually faster.

https://pytorch.org/tutorials/recipes/recipes/benchmark.html

[FA3] release and wheels

It's a beta release for now, we're doing more extensive testing before including it in the wheels.

Dose support kv cache is fp8 or int8 , but calculate is also fp16

Not yet. PRs are welcome.

How to start learning to manipulate tensor at low-level like flash-attention?

[Triton tutorials](https://triton-lang.org/main/getting-started/tutorials/index.html) are a good place to start to learn about how tensors are laid out in memory, and how to read & write to them. After that you can...

Getting Error While Extracting

Can you say what steps are required to reproduce this?

Is flashattention replace of multiheadattention support in NVIDIA DRIVE ORIN?

Probably. You can search github issues to see