lightning-thunder icon indicating copy to clipboard operation
lightning-thunder copied to clipboard

`test_networks::test_checkpointing_thunderfx` fails on (G)B200 due to grads mismatch

Open crcrpar opened this issue 2 months ago • 3 comments

🐛 Bug

test_networks.py::test_checkpointing_thunderfx fails due to grads mismatch between eager pytorch and thunderfx.

To Reproduce

Steps to reproduce the behavior:

  1. Run test_networks.py::test_checkpointing_thunderfx
  2. See grad mismatch e.g.
>       assert_close(grads_res, grads_ref, atol=1e-3, rtol=1e-3)
E       AssertionError: Tensor-likes are not close!
E
E       Mismatched elements: 62 / 20480 (0.3%)
E       Greatest absolute difference: 9818.546875 at index (1, 7) (up to 0.001 allowed)
E       Greatest relative difference: 22.85102081298828 at index (1, 6) (up to 0.001 allowed)
E
E       The failure occurred for item [1]

Expected behavior

Environment

pjnl-20250926

Additional context

  • It seems that the test case has been failing since mid August (b: 8/13, gb: 8/21)
  • The pytorch checkpointing function itself seems stable, from the file's commit history -- https://github.com/pytorch/pytorch/commits/viable/strict/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py

crcrpar avatar Sep 26 '25 18:09 crcrpar