returnn icon indicating copy to clipboard operation
returnn copied to clipboard

Torch gradient_checkpoint_scope could trigger segmentation fault?

Open albertz opened this issue 7 months ago • 16 comments

I just saw this in the CI (at commit d5b954b8f6e4c84ec2c289733590e1bf4154ba8b):

============================= test session starts ==============================
platform linux -- Python 3.10.[14](https://github.com/rwth-i6/returnn/actions/runs/9909690500/job/27378323845#step:7:15), pytest-8.2.2, pluggy-1.5.0
rootdir: /home/runner/work/returnn/returnn
configfile: pytest.ini
collected 2 items

tests/test_torch_util.py ..                                              [100%]

=============================== warnings summary ===============================
tests/test_torch_util.py::test_gradient_checkpoint_scope
  /home/runner/work/returnn/returnn/tests/test_torch_util.py:[15](https://github.com/rwth-i6/returnn/actions/runs/9909690500/job/27378323845#step:7:16)1: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844.
    torch.testing.assert_allclose(param_post_state[k], param_post_state_[k])

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================= 2 passed, 1 warning in 1.65s =========================
/home/runner/work/_temp/f14cefc5-56ba-4a81-9[17](https://github.com/rwth-i6/returnn/actions/runs/9909690500/job/27378323845#step:7:18)0-4e80a8ecf45f.sh: line 2:  [19](https://github.com/rwth-i6/returnn/actions/runs/9909690500/job/27378323845#step:7:20)90 Segmentation fault      (core dumped) python -m pytest tests/test_$TEST.py
Error: Process completed with exit code 139.

So tests ran through but at the exit, we got some segmentation fault. Maybe the gradient scope was cleaned up at that late point?

albertz avatar Jul 12 '24 14:07 albertz