TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

Split wgrad&dgrad from backward() to support a2a overlap

Open lhb8125 opened this issue 10 months ago • 2 comments

Description

Add a flag split_bw to control if we should separate wgrad from backward() and schedule it in another function to better hide the a2a communication when training moe models. This MR is to support the 1f1b with a2a overlap in MCore, similar with the idea in DualPipe. This feature has an assertion:

ub_bulk_wgrad == False

because the knob will bind the output of wgrad with dgrad , which complicates the computing context of wgrad.

Fixes # (issue)

Type of change

  • [ ] Documentation change (change only to the documentation, either a fix or a new content)
  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [x] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [ ] Infra/Build change
  • [ ] Code refactoring

Changes

Please list the changes introduced in this PR:

  • Add class WeightGradStore to put and pop the context of wgrad computation;
  • Wrap&store the wgrad computation of class Linear/LayernormLinear/GroupedLinear and pop it in wgrad_comp();
  • Add some unit tests in test_numerics.py;

Checklist:

  • [ ] I have read and followed the contributing guidelines
  • [ ] The functionality is complete
  • [ ] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] My changes generate no new warnings
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • [ ] New and existing unit tests pass locally with my changes

lhb8125 avatar Mar 12 '25 10:03 lhb8125

@lhb8125 could you please sign off your commits (guide)?

ksivaman avatar Apr 07 '25 13:04 ksivaman

@lhb8125 could you please sign off your commits (guide)?

@ksivaman Changing the history is too tricky. Could you turn to this MR, where all commits are signed off?

lhb8125 avatar Apr 08 '25 02:04 lhb8125