returnn
returnn copied to clipboard
Torch gradient_checkpoint_scope could trigger segmentation fault?
I just saw this in the CI (at commit d5b954b8f6e4c84ec2c289733590e1bf4154ba8b):
============================= test session starts ==============================
platform linux -- Python 3.10.[14](https://github.com/rwth-i6/returnn/actions/runs/9909690500/job/27378323845#step:7:15), pytest-8.2.2, pluggy-1.5.0
rootdir: /home/runner/work/returnn/returnn
configfile: pytest.ini
collected 2 items
tests/test_torch_util.py .. [100%]
=============================== warnings summary ===============================
tests/test_torch_util.py::test_gradient_checkpoint_scope
/home/runner/work/returnn/returnn/tests/test_torch_util.py:[15](https://github.com/rwth-i6/returnn/actions/runs/9909690500/job/27378323845#step:7:16)1: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844.
torch.testing.assert_allclose(param_post_state[k], param_post_state_[k])
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================= 2 passed, 1 warning in 1.65s =========================
/home/runner/work/_temp/f14cefc5-56ba-4a81-9[17](https://github.com/rwth-i6/returnn/actions/runs/9909690500/job/27378323845#step:7:18)0-4e80a8ecf45f.sh: line 2: [19](https://github.com/rwth-i6/returnn/actions/runs/9909690500/job/27378323845#step:7:20)90 Segmentation fault (core dumped) python -m pytest tests/test_$TEST.py
Error: Process completed with exit code 139.
So tests ran through but at the exit, we got some segmentation fault. Maybe the gradient scope was cleaned up at that late point?