ert icon indicating copy to clipboard operation
ert copied to clipboard

Torque driver does not handle non-zero exit code from jobs

Open sondreso opened this issue 3 years ago • 2 comments

Describe the bug The torque driver does not propagate information about failing jobs in the same way as the other drivers. For example, the LSF driver will return JOB_QUEUE_EXIT if the LSF job exited with non-zero exit code. The torque driver can produce the following statuses (see also http://docs.adaptivecomputing.com/torque/4-1-3/Content/topics/commands/qstat.htm#standardOutput):

https://github.com/equinor/ert/blob/94a29ddf18ec6f72e94d5cbc6a097c57db67a411/src/libres/lib/job_queue/torque_driver.cpp#L609-L631

This will lead to strange behavior, where ERT will attempt to load data, even though the job failed. The realization will be marked as successful, despite having failed jobs.

To reproduce Can be reproduced by adding a failing job after the pole_eval in the poly example, and using the torque driver. Data will be loaded and the realization appear successful, even though the last job failed.

Expected behaviour Handle failing jobs the same way as e.g. the LSF driver does, where jobs exiting with non-zero error code will propagate JOB_QUEUE_EXIT from the driver and not JOB_QUEUE_DONE

Screenshots image

Environment

  • OS: RHEL7
  • ERT/Komodo release: bleeding (~2.38) on azure
  • Python version: 3.8
  • Remote/HPC execution involved: yes

Additional context N/A

See also: https://github.com/equinor/ert/issues/3781

sondreso avatar Aug 09 '22 11:08 sondreso

This involves modifying the driver to query the queueing system for each individual job with qstat -f <jobid> and then parsing the output. From man qstat on an Azure-node:

       Job Status in Long Format
       Trigger: the -f option.
       If you specify the -f (full) option, full job status information for each job is displayed in this order:
            The job ID
            Each job attribute, one to a line
            The job's submission arguments
            The job's executable, in JSDL format
            The executable's argument list, in JSDL format

Example for a failed job (a python job doing sys.exit(1)):

$ qstat -x -f 4270 | head -n 1
Job Id: 4270.s034-lcam
$ qstat -x -f 4270 | grep "Exit_status"
    Exit_status = 265

For a successful job, the Exit_status is reported as zero.

The -x option is needed as PBS might throw finished jobs out of the result before ERT is able to pick up the result (see #3880 )

berland avatar Oct 24 '22 12:10 berland

The driver can also be changed to always call qstat with the -f option, and pick the job status from this line:

 qstat -x -f 4233 | grep job_state
    job_state = F

berland avatar Oct 24 '22 12:10 berland