ert stderr from job_dispatch is lost when runpath is missing

Describe the bug

If job_dispatch.py fails on a cluster node due to runpath not existing, it will fail with sys.exit and a "No such directory" message, but this error is not conveyed to the ERT user on the master node. This error goes to stderr in the LSF system, which normally is dumped to the runpath, but that cannot happen as it does not exist.

This situation can arise either due to user errors, or if the filesystem on the cluster is misconfigured or flaky.

To Reproduce

Set queue driver to LSF in snake_oil.ert.
mkdir /tmp/my-runpath on your master node (this constitutes a "user error")
Set RUNPATH /tmp/my-runpath/realization-%d in snake_oil.ert
ert gui snake_oil.ert
Run ensemble experiment, one realization is enough (but not test run, as we need the cluster)
Observe all realizations fail, but you will find no hint of what is actually the problem.

Expected behavior stdout/stderr from LSF conveyed to the user in some way.

Environment

OS: RHEHL7
ERT/Komodo Release: 2022.05
Python version: 3.8
Remote/HPC execution involved: yes

Additional context The cluster job probably gets to this line:

https://github.com/equinor/ert/blob/237024c221a415a9300b658320b68d9fcddc12db/job_runner/cli.py#L57

and then it exits. It would be natural to think that job_dispatch should call home in this situation with the message, but that is impossible as it uses information in the runpath to do so, loaded in the following lines.

Jun 10 '22 13:06 berland

But we write a lot of files in the run_path prior to running job_dispatch.py. So why isn't this caught earlier?

Jun 10 '22 18:06 jondequinor

But we write a lot of files in the run_path prior to running job_dispatch.py. So why isn't this caught earlier?

These files are written by the master node, which in this scenario has access to the runpath.

Jun 12 '22 20:06 berland

Some potential solutions:

Have LSF return the stderr to the client. Check in the LSF documentation if this is possible at all. This would be the preferred way, as we would able to catch more errors down the line. EDIT: Was not able to get this to work. However, we found a command "qpeek" which did exactly what we want here, but it removes the logs after completing.
Verify that the runpath is not a shared disk when running on non local queue, and output warning if it is. EDIT: This requires seperate commands for every OS, and we will only be able to get the format of the filesystem; not whether or not it is shared.

Nov 01 '23 12:11 jonathan-eq

Putting this back to Todo as it is an improvement for mostly us developers, and we have not found a good way of implementing it. We could hard code it to check if runpath starts with /tmp, but that is a bad practice..

Nov 02 '23 09:11 jonathan-eq

Solving #7694 might help this issue.

Apr 18 '24 08:04 berland

The bsub command accepts a -e option that redirects stderr to a file.

Testing:

be-linrgsn001:/tmp/foo]$ bsub -o stdout -e stderr "echo yay > /tmp/foo/yayfile"; echo $?
bash: line 0: cd: /tmp/foo: No such file or directory
Job <10822> is submitted to default queue <normal>.
0

The "No such file or directory" stems from line 101 in /global/bin/bsub, but that error/returncode is ignored: $RSH $LSFSRV "cd $MYPWD;/global/bin/$PROG $ARGS" There is nothing in /tmp/foo/stderr, where we would like a "No such file or directory" present.

One could think that the wrapping provided by /global/bin/bsub is perturbing this. However, running /global/bin/bsub on a LSF grid server with CWD to /tmp/something reveals that the redirection to stderr happens on the compute node side.

Apr 24 '24 08:04 berland

There might be a possibility in parsing "Execution CWD" from bjobs -l <jobid> after a realization is finished (with failure). Execution CWD will be set to the users home directory if the requested CWD does not exist, so if there is a mismatch, a warning could be issued.

Apr 24 '24 08:04 berland

ert ert copied to clipboard

stderr from job_dispatch is lost when runpath is missing

ert
ert copied to clipboard