ert icon indicating copy to clipboard operation
ert copied to clipboard

Verify that the selected queue type can be used

Open lars-petter-hauge opened this issue 8 months ago • 1 comments

Describe the bug

Bad traceback in case the cluster runner is ill configured. It would be nice if ert could check that the selected driver can be used before trying to submit jobs.

To reproduce Steps to reproduce the behaviour:

  1. Connect to equinor azure node
  2. ert gui my_config.ert
  3. Run experiment (IES/Smoother/ESMDA/Test)

Expected behaviour A better error message

Screenshots The following will be printed in terminal the amount of times we send qsub (so at least once for each realisation)

Command "/opt/pbs/bin/qsub -rn -Nstress.ert-1 -q short -o /dev/null -e /dev/null -l select=1:ncpus=1" failed with exit code 160, output: "<empty>", and error:  Unknown Host.
qsub: cannot connect to server Please (errno=15008)"
Exception in scheduler task job-1_task: Command "/opt/pbs/bin/qsub -rn -Nstress.ert-1 -q short -o /dev/null -e /dev/null -l select=1:ncpus=1" failed with exit code 160, output: "<empty>", and error: "Unknown Host.
qsub: cannot connect to server Please (errno=15008)"
Traceback: Traceback (most recent call last):
  File "/usr/lib64/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/prog/komodo/2024.06.rc1-py38-rhel8/root/lib64/python3.8/site-packages/_ert/async_utils.py", line 53, in _done_callback
    raise exc
  File "/prog/komodo/2024.06.rc1-py38-rhel8/root/lib64/python3.8/site-packages/ert/scheduler/job.py", line 131, in run
    await self._submit_and_run_once(sem)
  File "/prog/komodo/2024.06.rc1-py38-rhel8/root/lib64/python3.8/site-packages/ert/scheduler/job.py", line 99, in _submit_and_run_once
    await self.driver.submit(
  File "/prog/komodo/2024.06.rc1-py38-rhel8/root/lib64/python3.8/site-packages/ert/scheduler/openpbs_driver.py", line 214, in submit
    raise RuntimeError(process_message)
RuntimeError: Command "/opt/pbs/bin/qsub -rn -Nstress.ert-1 -q short -o /dev/null -e /dev/null -l select=1:ncpus=1" failed with exit code 160, output: "<empty>", and error: "Unknown Host.
qsub: cannot connect to server Please (errno=15008)"

Environment

  • ERT/Komodo release: Any
  • Remote/HPC execution involved: yes

Additional context The reason is that the default pbs server cannot be used, and it is expected that the user sets the server themselves.

$ cat /etc/pbs.conf
PBS_EXEC=/opt/pbs
PBS_SERVER=Please set SERVER_NAME in your environment

lars-petter-hauge avatar Jun 10 '24 08:06 lars-petter-hauge