ert
ert copied to clipboard
stderr from job_dispatch is lost when runpath is missing
Describe the bug
If job_dispatch.py
fails on a cluster node due to runpath not existing, it will fail with sys.exit
and a "No such directory" message, but this error is not conveyed to the ERT user on the master node. This error goes to stderr in the LSF system, which normally is dumped to the runpath, but that cannot happen as it does not exist.
This situation can arise either due to user errors, or if the filesystem on the cluster is misconfigured or flaky.
To Reproduce
- Set queue driver to
LSF
insnake_oil.ert
. -
mkdir /tmp/my-runpath
on your master node (this constitutes a "user error") - Set
RUNPATH /tmp/my-runpath/realization-%d
insnake_oil.ert
-
ert gui snake_oil.ert
- Run ensemble experiment, one realization is enough (but not test run, as we need the cluster)
- Observe all realizations fail, but you will find no hint of what is actually the problem.
Expected behavior stdout/stderr from LSF conveyed to the user in some way.
Environment
- OS: RHEHL7
- ERT/Komodo Release: 2022.05
- Python version: 3.8
- Remote/HPC execution involved: yes
Additional context The cluster job probably gets to this line:
https://github.com/equinor/ert/blob/237024c221a415a9300b658320b68d9fcddc12db/job_runner/cli.py#L57
and then it exits. It would be natural to think that job_dispatch
should call home in this situation with the message, but that is impossible as it uses information in the runpath to do so, loaded in the following lines.
But we write a lot of files in the run_path prior to running job_dispatch.py
. So why isn't this caught earlier?
But we write a lot of files in the run_path prior to running
job_dispatch.py
. So why isn't this caught earlier?
These files are written by the master node, which in this scenario has access to the runpath.
Some potential solutions:
- Have LSF return the stderr to the client. Check in the LSF documentation if this is possible at all. This would be the preferred way, as we would able to catch more errors down the line. EDIT: Was not able to get this to work. However, we found a command "qpeek" which did exactly what we want here, but it removes the logs after completing.
- Verify that the runpath is not a shared disk when running on non local queue, and output warning if it is. EDIT: This requires seperate commands for every OS, and we will only be able to get the format of the filesystem; not whether or not it is shared.
Putting this back to Todo as it is an improvement for mostly us developers, and we have not found a good way of implementing it. We could hard code it to check if runpath starts with /tmp, but that is a bad practice..
Solving #7694 might help this issue.
The bsub
command accepts a -e
option that redirects stderr to a file.
Testing:
be-linrgsn001:/tmp/foo]$ bsub -o stdout -e stderr "echo yay > /tmp/foo/yayfile"; echo $?
bash: line 0: cd: /tmp/foo: No such file or directory
Job <10822> is submitted to default queue <normal>.
0
The "No such file or directory" stems from line 101 in /global/bin/bsub, but that error/returncode is ignored:
$RSH $LSFSRV "cd $MYPWD;/global/bin/$PROG $ARGS"
There is nothing in /tmp/foo/stderr
, where we would like a "No such file or directory" present.
One could think that the wrapping provided by /global/bin/bsub
is perturbing this. However, running /global/bin/bsub on a LSF grid server with CWD to /tmp/something reveals that the redirection to stderr happens on the compute node side.
There might be a possibility in parsing "Execution CWD" from bjobs -l <jobid>
after a realization is finished (with failure). Execution CWD will be set to the users home directory if the requested CWD does not exist, so if there is a mismatch, a warning could be issued.