ert icon indicating copy to clipboard operation
ert copied to clipboard

Job runner freezes when a step fails

Open pinkwah opened this issue 1 year ago • 4 comments

In the user's run, the long-running forward model step failed due to SIGSEGV and the previous step is reported as "running" with a -1 days runtime. The ensemble is considered by Ert to still be running even though it's obvious that every realisation has failed.

https://web.yammer.com/main/org/statoil.com/threads/eyJfdHlwZSI6IlRocmVhZCIsImlkIjoiMjY2Njc3MTE5ODYwNzM2MCJ9?trk_copy_link=V2_HTML

pinkwah avatar Feb 26 '24 10:02 pinkwah

We need to get the logs (both ert-logs and job-runner-logs) for this one.

xjules avatar Feb 28 '24 13:02 xjules

I would like to do this together with someone, but it is vinterferie so most people are WFH or WFC (work from cabin). Will look closer at it on Monday.

jonathan-eq avatar Mar 01 '24 07:03 jonathan-eq

Unassigning myself as other bugs got priority over this one

jonathan-eq avatar Mar 06 '24 13:03 jonathan-eq

I think this is going to be fixed by the solution to https://github.com/equinor/ert/issues/4396 but that requires a refactor of the "snapshot" code.

pinkwah avatar Mar 07 '24 08:03 pinkwah

How to reproduce: increase MAX_SUBMIT and random SEGFAULT; eg. signal process to kill itself.

xjules avatar Jun 17 '24 10:06 xjules

Tried poly-case with:

  • LSF
  • poly_eval.py modified to segfault with some probability
  • MAX_SUBMIT set to 2
  • Komodo 2024.06.07 (ert 10.1)

Ran ensemble_experiment and watched carefully the "Running time" and the progress bars.

Could not observe anything wrong.

berland avatar Jul 01 '24 11:07 berland