open_lm icon indicating copy to clipboard operation
open_lm copied to clipboard

grad accum tests failing on gpu w/ amp_bf16 precision

Open sagadre opened this issue 2 years ago • 0 comments

changing precision from fp32 to amp_bf16 leads to pytest tests/test_grad_accum.py failing

FAILED tests/test_grad_accum.py::test_grad_acc - AssertionError: Failed gradient checks at: ['tok_embeddings.weight', 'layers.0.attention.in_proj.weight', 'layers.0...
FAILED tests/test_grad_accum.py::test_grad_acc_fsdp - torch.multiprocessing.spawn.ProcessRaisedException: 

sagadre avatar Dec 19 '23 23:12 sagadre