ert Job runner freezes when a step fails

In the user's run, the long-running forward model step failed due to SIGSEGV and the previous step is reported as "running" with a -1 days runtime. The ensemble is considered by Ert to still be running even though it's obvious that every realisation has failed.

https://web.yammer.com/main/org/statoil.com/threads/eyJfdHlwZSI6IlRocmVhZCIsImlkIjoiMjY2Njc3MTE5ODYwNzM2MCJ9?trk_copy_link=V2_HTML

Feb 26 '24 10:02 pinkwah

We need to get the logs (both ert-logs and job-runner-logs) for this one.

Feb 28 '24 13:02 xjules

I would like to do this together with someone, but it is vinterferie so most people are WFH or WFC (work from cabin). Will look closer at it on Monday.

Mar 01 '24 07:03 jonathan-eq

Unassigning myself as other bugs got priority over this one

Mar 06 '24 13:03 jonathan-eq

I think this is going to be fixed by the solution to https://github.com/equinor/ert/issues/4396 but that requires a refactor of the "snapshot" code.

Mar 07 '24 08:03 pinkwah

How to reproduce: increase MAX_SUBMIT and random SEGFAULT; eg. signal process to kill itself.

Jun 17 '24 10:06 xjules

Tried poly-case with:

LSF
poly_eval.py modified to segfault with some probability
MAX_SUBMIT set to 2
Komodo 2024.06.07 (ert 10.1)

Ran ensemble_experiment and watched carefully the "Running time" and the progress bars.

Could not observe anything wrong.

Jul 01 '24 11:07 berland