open_lm
open_lm copied to clipboard
grad accum tests failing on gpu w/ amp_bf16 precision
changing precision from fp32 to amp_bf16 leads to pytest tests/test_grad_accum.py failing
FAILED tests/test_grad_accum.py::test_grad_acc - AssertionError: Failed gradient checks at: ['tok_embeddings.weight', 'layers.0.attention.in_proj.weight', 'layers.0...
FAILED tests/test_grad_accum.py::test_grad_acc_fsdp - torch.multiprocessing.spawn.ProcessRaisedException: