maxtext issues

Ragged attention and test, not actively used.

Adds ragged attention kernel in Pallas as well as a unit test for the new code. Note that the ragged attention kernel is not actively being used in this code...

patemotter

added support for cudnn flash attention

2

- Implemented cudnn flash attention with Transformer Engine - Currently it supports head_dim till 128 and does not support GQA yet. It's an unstable API and would soon change it...

kocchop

python MaxText/decode.py MaxText/configs/base.yml per_device_batch_size=64 run_name=runner_2024-01-30-20-02 max_prefill_predict_length=128 max_target_length=256 dataset_path=gs://maxtext-dataset async_checkpointing=false scan_layers=false attention=dot_product scan_layers=false ici_autoregressive_parallelism=4 400GB/s/device on a v4-8

rwitten