ert job_runner classifies successful forward

job_runner classifies successful forward_model as failure

Open berland opened this issue 2 years ago • 2 comments

Describe the bug When testing Drogon in Azure, the following failure is observed:

realization-0/iter-0]$ cat ERROR
<error>
  <time>10:23:22</time>
  <job>DESIGN_KW</job>
  <reason>The target file:DESIGN_KW.OK has not been updated; this is flagged as failure. mtime:1653294197.0   stat_start_time:1653294197.0</reason>
  <stderr>
<Not written by:DESIGN_KW>
</stderr>
</error>

In this case, DESIGN_KW successfully created the file as it should, but it looks like the filesystem acts too fast for: https://github.com/equinor/ert/blob/6d118d1500aa7daa4573782406adb78e78c22e16/job_runner/job.py#L242

Looking at the status.json, the previous design_kw forward models seems to have completed within ~1ms.

To Reproduce Steps to reproduce the behavior: 1 Run Drogon in Azure with the Torque driver, here tested with the drogon_design.ert.

Expected behavior Successful jobs should be classified as such.

Screenshots If applicable, add screenshots to help explain your problem.

Enviromment

OS: RHEL7
ERT/Komodo Release: 2.35, komodo-stable
3.8
Remote/HPC execution involved: yes

May 23 '22 08:05 berland

I see two possible solutions

use the nanosecond-version, i.e. stat.st_mtime_ns, see https://docs.python.org/3.6/library/os.html#os.stat_result and in particular the note about resolution
use a hash of the file-content instead of the mtime (we don't use the time itself, only whether the file has changed)

The latter is more robust and my preference, but may require slightly more processing-time.

I'll cook up a patch but someone else needs to test it in Azure.

May 24 '22 06:05 BjarneHerland

Still occurs after merging #3428 :

The target file:DESIGN_KW.OK has not been updated; this is flagged as failure. mtime:1654868393.0 stat_start_time:1654868393000000000

Jun 10 '22 13:06 berland

Apparently, this issue was solved by removing TARGET_FILE from job configurations as this was the issue for smaller jobs testing they've succeed. For reference: https://github.com/equinor/semeio/issues/431 Closing this one then.

Oct 24 '22 11:10 xjules

ert ert copied to clipboard

job_runner classifies successful forward_model as failure

ert
ert copied to clipboard