Improve ambiguous logging when max_submit is 1
Currently even-though MAX_SUBMIT is set 1 we log failure with failed after reaching max submit. We should provide more detailed explanation in job.handle_failure.
Also job._callback_status_msg might be empty, which produces empty output:
Realization: 29 failed after reaching max submit (1):
Realization: 44 failed after reaching max submit (1):
Realization: 30 failed after reaching max submit (1):
Suggestion what to log:
- exit code of the last successful / failed job
- the last know state
Apparently, when mimicking NFS syncing issues, we will not get any logs too: Copy from @eivindjahren message:
If _ert_forward_model_runner crashes (for instance due to missing jobs.json because of NFS sync issues) then you get no indication of what happened. Just the empty failure message:
Realization: 44 failed after reaching max submit (1):
You can reproduce it with fault injecting not writing the jobs.json file:
--- a/src/ert/enkf_main.py
+++ b/src/ert/enkf_main.py
@@ -231,7 +231,7 @@ def create_run_path(
run_context.iteration,
)
- json.dump(forward_model_output, fptr)
+ # json.dump(forward_model_output, fptr)
run_context.runpaths.write_runpath_list(
[run_context.iteration], run_context.active_realizations
class LegacyEnsemble(Ensemble):
@@ -226,7 +227,7 @@ async def _evaluate_inner( # pylint: disable=too-many-branches
self.min_required_realizations if self.stop_long_running else 0
)
- queue.add_dispatch_information_to_jobs_file()
+ # queue.add_dispatch_information_to_jobs_file()
result = await queue.execute(min_required_realizations)
except Exception:
The logging might be already fixed by 50a4421. Need to just test.
What we should do is to "find out" that the job does not run and get the lsf stdout into the logs.