ert icon indicating copy to clipboard operation
ert copied to clipboard

Flaky scheduler integration tests running against real LSF

Open jonathan-eq opened this issue 10 months ago • 2 comments

Describe the bug Some of the scheduler integration tests are timing out and failing when running against the real LSF cluster.

FAILED integration_tests/scheduler/test_generic_driver.py::test_submit[LsfDriver] - Failed: Timeout >360.0s
FAILED integration_tests/scheduler/test_lsf_driver.py::test_submit_to_named_queue - Failed: Timeout >360.0s

(https://github.com/equinor/komodo-releases/actions/runs/8596028934/job/23554954383)

Expected behaviour Tests not failing.

Environment

  • OS: RHEL8
  • ERT/Komodo release: ert-onprem-2024.04.rc0-py3.11-rhel8
  • Python version: 3.11
  • Remote/HPC execution involved: Yes, onprem runner.

jonathan-eq avatar Apr 08 '24 12:04 jonathan-eq

The test_submit_to_named_queue was probably due to issues on the rhel8 node cluster. The short queue was down so all jobs were stuck in pending.

larsevj avatar Apr 08 '24 15:04 larsevj

Unsure about the other one, but given the very limited compute nodes available on rhel8 the tests might time out before the job is actually submitted to a compute node.

larsevj avatar Apr 08 '24 15:04 larsevj

Unsure about the other one, but given the very limited compute nodes available on rhel8 the tests might time out before the job is actually submitted to a compute node.

Shall we then skip this test against the real RHEL8 cluster and thus fall back to mocked driver? @larsevj @berland

xjules avatar Apr 10 '24 07:04 xjules

I am not sure if it is an issue running it as long as we are aware that the timeouts on rhel8 are to be expected. They seem to be running fine on RHEL7.

larsevj avatar Apr 10 '24 07:04 larsevj

I suggest to increase the timeout to one hour.

berland avatar Apr 10 '24 08:04 berland