Tldr: Add implementations of FlashAttention using OpenAI's triton language.

Background:

FlashAttention: an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes, 15% end-to-end speedup on BERT-large compared to the MLPerf 1.1, 3× speedup on GPT-2, and 2.4× speedup on long-range arena.
Triton: Python-like programming language to write highly efficient GPU code.

Major Changes:

add FlashAttention forward pass to parlai
replace encoder self-attention with FlashAttention
add unit test for self-attention functionality to ensure correctness within acceptable epsilon (0.01)

Evaluations:

unit tests

	N_CTX	Triton	ParlAI
0	512.0	0.333863	0.696174
1	1024.0	1.005227	2.410752
2	2048.0	3.513344	9.326592
3	4096.0	13.148160	37.028866

convai2 results
- runtime
- result qualitative
- result quantitative

Testing Steps:

Aug 04 '22 20:08 pearlli98

wow this heroic change set

Aug 05 '22 03:08 stephenroller

is this ready for review or still a draft?

Aug 15 '22 16:08 klshuster

@klshuster I don't think the code will ever be mergeble given Triton's experimental nature. I talked with some people worked on Flash attention, it seems Triton's implementation only work with specific head dim. I asked @pearlli98 to open a PR in order to save her WIP, and we could review the code if we want to.

Aug 15 '22 16:08 dexterju27

perhaps we can merge some of the results to an internal project directory?

Aug 16 '22 13:08 klshuster

perhaps we can merge some of the results to an internal project directory?

@klshuster I can push the results to parlai-internal. The problem is that we don't have the transformers modules files, which i made changes to, on the internal repo. Do you want me to separate the changes in two different places (code files here and results on internal) ?

Aug 16 '22 17:08 pearlli98

yes that would be great

Aug 16 '22 21:08 klshuster

yes that would be great

@klshuster I have removed the results folder and move it to ParlAI-internal under this PR.

Aug 19 '22 16:08 pearlli98

Hi @pearlli98!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Aug 20 '22 07:08 facebook-github-bot

ParlAI
ParlAI copied to clipboard

Add FlashAttention Kernel in Triton

Process

ParlAI ParlAI copied to clipboard

Add FlashAttention Kernel in Triton

Process

ParlAI
ParlAI copied to clipboard