Job runner freezes when a step fails
In the user's run, the long-running forward model step failed due to SIGSEGV and the previous step is reported as "running" with a -1 days runtime. The ensemble is considered by Ert to still be running even though it's obvious that every realisation has failed.
https://web.yammer.com/main/org/statoil.com/threads/eyJfdHlwZSI6IlRocmVhZCIsImlkIjoiMjY2Njc3MTE5ODYwNzM2MCJ9?trk_copy_link=V2_HTML
We need to get the logs (both ert-logs and job-runner-logs) for this one.
I would like to do this together with someone, but it is vinterferie so most people are WFH or WFC (work from cabin). Will look closer at it on Monday.
Unassigning myself as other bugs got priority over this one
I think this is going to be fixed by the solution to https://github.com/equinor/ert/issues/4396 but that requires a refactor of the "snapshot" code.
How to reproduce: increase MAX_SUBMIT and random SEGFAULT; eg. signal process to kill itself.
Tried poly-case with:
- LSF
- poly_eval.py modified to segfault with some probability
- MAX_SUBMIT set to 2
- Komodo 2024.06.07 (ert 10.1)
Ran ensemble_experiment and watched carefully the "Running time" and the progress bars.
Could not observe anything wrong.