deep-learning-containers icon indicating copy to clipboard operation
deep-learning-containers copied to clipboard

[bug] Outdated TransformerEngine

Open dbpprt opened this issue 1 year ago • 1 comments

Checklist

  • [x] I've prepended issue tag with type of change: [bug]
  • [ ] (If applicable) I've attached the script to reproduce the bug
  • [x] (If applicable) I've documented below the DLC image/dockerfile this relates to
  • [ ] (If applicable) I've documented below the tests I've run on the DLC image
  • [x] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
  • [ ] I've built my own container based off DLC (and I've attached the code used to build my own image)

Concise Description: The included version of TransformerEngine (0.12.0) is not compatible with FlashAttention > 2.0.4 whilst recent transformer version require FlashAttention > 2.0.4

DLC image/dockerfile: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker

Current behavior: Old version, doesn't support recent versions of FA

Expected behavior: It should be usable with recent versions of FA/transformers

Additional context:

dbpprt avatar Jun 07 '24 14:06 dbpprt

we are also working on a pip wheel for TEv1.11 (ETA 10/15) that will remove the version requirement for flash-attn and make it an optional dependency. That might be a good time to update the DLC.

sbhavani avatar Sep 05 '24 14:09 sbhavani

https://github.com/aws/deep-learning-containers/blob/master/pytorch/training/docker/2.6/py3/cu126/Dockerfile.ec2.gpu.core_packages.json - TE is updated to version 2.0+

sbhavani avatar Apr 18 '25 19:04 sbhavani