Torque driver does not handle non-zero exit code from jobs
Describe the bug
The torque driver does not propagate information about failing jobs in the same way as the other drivers. For example, the LSF driver will return JOB_QUEUE_EXIT if the LSF job exited with non-zero exit code. The torque driver can produce the following statuses (see also http://docs.adaptivecomputing.com/torque/4-1-3/Content/topics/commands/qstat.htm#standardOutput):
https://github.com/equinor/ert/blob/94a29ddf18ec6f72e94d5cbc6a097c57db67a411/src/libres/lib/job_queue/torque_driver.cpp#L609-L631
This will lead to strange behavior, where ERT will attempt to load data, even though the job failed. The realization will be marked as successful, despite having failed jobs.
To reproduce
Can be reproduced by adding a failing job after the pole_eval in the poly example, and using the torque driver.
Data will be loaded and the realization appear successful, even though the last job failed.
Expected behaviour
Handle failing jobs the same way as e.g. the LSF driver does, where jobs exiting with non-zero error code will propagate JOB_QUEUE_EXIT from the driver and not JOB_QUEUE_DONE
Screenshots

Environment
- OS: RHEL7
- ERT/Komodo release: bleeding (~2.38) on azure
- Python version: 3.8
- Remote/HPC execution involved: yes
Additional context N/A
See also: https://github.com/equinor/ert/issues/3781
This involves modifying the driver to query the queueing system for each individual job with qstat -f <jobid> and then parsing the output. From man qstat on an Azure-node:
Job Status in Long Format
Trigger: the -f option.
If you specify the -f (full) option, full job status information for each job is displayed in this order:
The job ID
Each job attribute, one to a line
The job's submission arguments
The job's executable, in JSDL format
The executable's argument list, in JSDL format
Example for a failed job (a python job doing sys.exit(1)):
$ qstat -x -f 4270 | head -n 1
Job Id: 4270.s034-lcam
$ qstat -x -f 4270 | grep "Exit_status"
Exit_status = 265
For a successful job, the Exit_status is reported as zero.
The -x option is needed as PBS might throw finished jobs out of the result before ERT is able to pick up the result (see #3880 )
The driver can also be changed to always call qstat with the -f option, and pick the job status from this line:
qstat -x -f 4233 | grep job_state
job_state = F