TransformerEngine
TransformerEngine copied to clipboard
[PyTorch] Re-enable bias+GELU fusion for non-reentrant checkpointing -- WIP
TorchDynamo has known limitations for autograd.Function implementations and autograd.graph hooks. Activation recompute utilizes both of those mechanisms, so this PR disables TorchDynamo on te.distributed.checkpoint() via the @no_torch_dynamo() decorator.
@ksivaman Did we implement/merge lazy init for TE/PyTorch yet? If so, I can rebase, test and merge this to re-enable the fusion with checkpointing.