ert icon indicating copy to clipboard operation
ert copied to clipboard

Improve ambiguous logging when max_submit is 1

Open xjules opened this issue 1 year ago • 5 comments

Currently even-though MAX_SUBMIT is set 1 we log failure with failed after reaching max submit. We should provide more detailed explanation in job.handle_failure. Also job._callback_status_msg might be empty, which produces empty output:

Realization: 29 failed after reaching max submit (1):
	
Realization: 44 failed after reaching max submit (1):
	
Realization: 30 failed after reaching max submit (1):

xjules avatar Apr 24 '24 11:04 xjules

Suggestion what to log:

  • exit code of the last successful / failed job
  • the last know state

xjules avatar Apr 25 '24 11:04 xjules

Apparently, when mimicking NFS syncing issues, we will not get any logs too: Copy from @eivindjahren message:


If _ert_forward_model_runner crashes (for instance due to missing jobs.json because of NFS sync issues) then you get no indication of what happened. Just the empty failure message:

Realization: 44 failed after reaching max submit (1):

You can reproduce it with fault injecting not writing the jobs.json file:

--- a/src/ert/enkf_main.py
+++ b/src/ert/enkf_main.py
@@ -231,7 +231,7 @@ def create_run_path(
                     run_context.iteration,
                 )
 
-                json.dump(forward_model_output, fptr)
+                # json.dump(forward_model_output, fptr)
 
     run_context.runpaths.write_runpath_list(
         [run_context.iteration], run_context.active_realizations
 class LegacyEnsemble(Ensemble):
@@ -226,7 +227,7 @@ async def _evaluate_inner(  # pylint: disable=too-many-branches
                 self.min_required_realizations if self.stop_long_running else 0
             )
 
-            queue.add_dispatch_information_to_jobs_file()
+            # queue.add_dispatch_information_to_jobs_file()
             result = await queue.execute(min_required_realizations)
 
         except Exception:

xjules avatar Apr 29 '24 12:04 xjules

The logging might be already fixed by 50a4421. Need to just test.

xjules avatar May 02 '24 13:05 xjules

The logging might be already fixed by 50a4421. Need to just test.

It did not fix it

jonathan-eq avatar May 03 '24 08:05 jonathan-eq

What we should do is to "find out" that the job does not run and get the lsf stdout into the logs.

xjules avatar May 03 '24 11:05 xjules