Less Wright

Results 44 issues of Less Wright

update main Rstd tensor name vs rstd to see which one registering fusedRMS as an op is really concerned with.

CLA Signed

This PR builds on top of the prior PR (1208) for this blog post, and adds: 1 - corrected author name 2 - missing math equations from appendix

**What does this PR do? Please describe:** Adds an automatic check for BFloat16 support to AnyPrecision optimizer (self.verify_bfloat_support()). This happens at optimizer init if any of the relevant states are...

CLA Signed

Enhancement (credit to @rohan-varma): "this can be done in a follow up PR, but let's maybe consider not defaulting things to torch.bfloat16 eventually. this is because it might be good...

enhancement

Problem - if the user runs AnyPrecision optimizer with Kahan and checkpoints the model/optimizer, restarting training may start with an empty compensation buffer. This is not a blocking problem, but...

enhancement

Using latest nightly (1109) and running on H100 server: running tests/local_test_c10d.py results in the final tensor comparison failing with 16% mismatch (appears to be rounding, largest diff is .0097). ~~~...

PR's for tau are failing due to an unrelated missing rule for DTensor: "Operator aten.fill.Scalar does not have a DistributedTensor rule registered." Details: Traceback (most recent call last): File "/__w/tau/tau/test/spmd/tensor/test_dtensor_ops.py",...

currently we run fusion based on an integer policy, where integer maps to total number of comm calls to fuse. need to add a bucket size policy handler to setup...

Currently we assume all comm calls can be fused with any other comm call (i.e. all use default process group). This is usually correct, but need to implement check of...

Currently we default to FP32 for the fusion buffer, but that is not correct for mixed precison cases. Thus, need to check shape prop metadata and build correct buffer dtype.