maxtext added support for cudnn flash attention

added support for cudnn flash attention

Open kocchop opened this issue 1 year ago • 2 comments

Implemented cudnn flash attention with Transformer Engine
Currently it supports head_dim till 128 and does not support GQA yet. It's an unstable API and would soon change it to a stable one with more feature support
Added a flag --multiprocess_gpu to support multiprocess SLURM tasks

Feb 10 '24 01:02 kocchop

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Feb 10 '24 01:02 google-cla[bot]

Hi @rwitten I have made the changes. Please let me know how it looks. Also I have been added to the contributor's list. My current commit email is verified but my inital commit email is not. Is there any way to remove it? Just wondering what could be the easiest way to fix this. Thanks!

Feb 17 '24 03:02 kocchop

maxtext maxtext copied to clipboard

added support for cudnn flash attention

maxtext
maxtext copied to clipboard