Less Wright issues

Results 44 issues of


Less Wright

[wip] differentiate Rstd vs rstd

update main Rstd tensor name vs rstd to see which one registering fusedRMS as an op is really concerned with.

CLA Signed

scaling fsdp on ibm cloud blog post, updated version

This PR builds on top of the prior PR (1208) for this blog post, and adds: 1 - corrected author name 2 - missing math equations from appendix

[AnyPrecision optimizer] add automatic BF16 support check (network and gpu)

**What does this PR do? Please describe:** Adds an automatic check for BFloat16 support to AnyPrecision optimizer (self.verify_bfloat_support()). This happens at optimizer init if any of the relevant states are...

CLA Signed

[AnyPrecision optimizer] consider FP32 defaults, possibly automated via BF16 support check

Enhancement (credit to @rohan-varma): "this can be done in a follow up PR, but let's maybe consider not defaulting things to torch.bfloat16 eventually. this is because it might be good...

enhancement

[AnyPrecision optimizer] Kahan compensation buffer should be stored in state dict for checkpointing

Problem - if the user runs AnyPrecision optimizer with Kahan and checkpoints the model/optimizer, restarting training may start with an empty compensation buffer. This is not a blocking problem, but...

enhancement

[H100] local test C10D forward does not have tensor result equivalency (16% mismatch)

Using latest nightly (1109) and running on H100 server: running tests/local_test_c10d.py results in the final tensor comparison failing with 16% mismatch (appears to be rounding, largest diff is .0097). ~~~...

[DTensor] missing rule for aten.fill.Scalar causing unit tests to fail for SPMD

PR's for tau are failing due to an unrelated missing rule for DTensor: "Operator aten.fill.Scalar does not have a DistributedTensor rule registered." Details: Traceback (most recent call last): File "/__w/tau/tau/test/spmd/tensor/test_dtensor_ops.py",...

Less Wright