[bug] Outdated TransformerEngine
Checklist
- [x] I've prepended issue tag with type of change: [bug]
- [ ] (If applicable) I've attached the script to reproduce the bug
- [x] (If applicable) I've documented below the DLC image/dockerfile this relates to
- [ ] (If applicable) I've documented below the tests I've run on the DLC image
- [x] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
- [ ] I've built my own container based off DLC (and I've attached the code used to build my own image)
Concise Description: The included version of TransformerEngine (0.12.0) is not compatible with FlashAttention > 2.0.4 whilst recent transformer version require FlashAttention > 2.0.4
DLC image/dockerfile: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker
Current behavior: Old version, doesn't support recent versions of FA
Expected behavior: It should be usable with recent versions of FA/transformers
Additional context:
we are also working on a pip wheel for TEv1.11 (ETA 10/15) that will remove the version requirement for flash-attn and make it an optional dependency. That might be a good time to update the DLC.
https://github.com/aws/deep-learning-containers/blob/master/pytorch/training/docker/2.6/py3/cu126/Dockerfile.ec2.gpu.core_packages.json - TE is updated to version 2.0+