TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

[C/PyTorch] Refactor and move userbuffers into TE/common

Open denera opened this issue 1 year ago • 1 comments

This PR moves all the userbuffers code in TE/pytorch to TE/common and refactors the interfaces to make TE/common/userbuffers accessible to all framework integrations.

To do:

  • [x] Move userbuffers from TE/pytorch to TE/common.
  • [x] Bootstrap userbuffers with PyTorch collectives.
  • [x] Update build logic with CXX ABI version fix and correct rpaths.
  • [x] Implement comm overlap example for PyTorch.
  • [x] Verify split_overlap_ag_p2p
  • [x] Verify split_overlap_rs_p2p
  • [ ] Verify split_overlap_rs
  • [ ] Verify atomic_gemm_overlap_ag_p2p
  • [ ] Verify atomic_gemm_overlap_rs_p2p
  • [ ] Verify atomic_gemm_overlap_rs
  • [ ] Verify bulk_overlap for AG
  • [ ] Verify bulk_overlap for RS
  • [ ] Implement unit tests.

denera avatar Apr 08 '24 20:04 denera

@timmoon10 FYI I will be removing the 3rd party dlpack package I introduced earlier in this PR. It's not needed for the PyTorch collective callbacks, and I can bring it back if it becomes necessary for JAX down the line (but I'd like to avoid it if I can).

denera avatar May 18 '24 01:05 denera

This work has been moved to a new branch due to too many conflicts with TE/main. Closing the PR and filing a new one.

denera avatar Jul 31 '24 18:07 denera