torchdistx icon indicating copy to clipboard operation
torchdistx copied to clipboard

[AnyPrecision optimizer] add automatic BF16 support check (network and gpu)

Open lessw2020 opened this issue 2 years ago • 0 comments

What does this PR do? Please describe: Adds an automatic check for BFloat16 support to AnyPrecision optimizer (self.verify_bfloat_support()).
This happens at optimizer init if any of the relevant states are using torch.bfloat16.
This checks both GPU and Network (NCCL) BFloat16 support, and errs out with both error message and an exception if it fails.

Fixes #{issue number}

Does your PR introduce any breaking changes? If yes, please list them: List of all backwards-incompatible API changes.

Check list:

  • [ ] Was this discussed and approved via a GitHub issue? (not for typos or docs)
  • [ ] Did you read the contributor guideline?
  • [ ] Did you make sure that your PR does only one thing instead of bundling different changes together?
  • [ ] Did you make sure to update the documentation with your changes? (if necessary)
  • [ ] Did you write any new necessary tests?
  • [ ] Did you verify new and existing tests pass locally with your changes?
  • [ ] Did you update the CHANGELOG? (not for typos, docs, or minor internal changes)

lessw2020 avatar Sep 13 '22 01:09 lessw2020