ert icon indicating copy to clipboard operation
ert copied to clipboard

Rerun LSF tests on ssh failures (error code 255)

Open berland opened this issue 1 year ago • 4 comments

It is known that the commands for interacting with the LSF cluster goes through a shell wrapper that does a ssh-call to some LSF-server. If that server is too busy to respond to the ssh login, the command will return with error code 255.

This error code can be detected in the integration tests, and then the tests can be retried for some attempts.

https://github.com/equinor/ert/blob/2d21583bae5f52a367c3ea492b2b76bbf07608cc/tests/integration_tests/scheduler/test_lsf_driver.py#L187-L191

Suggestion is to raise a specific exception on this kind of error, and then use pytest-rerunfailures to wait some seconds and then retry a certain number of attempts:

https://pypi.org/project/pytest-rerunfailures/#re-run-individual-failures

berland avatar Sep 06 '24 04:09 berland

See https://github.com/equinor/ert/pull/8790, it is no (longer?) true that only SSH errors give 255.

berland avatar Oct 10 '24 08:10 berland

See #8790, it is no (longer?) true that only SSH errors give 255.

Hmm, you are right! Do we know the error message for flaky ssh? Maybe it is something like connection refused or similar.

jonathan-eq avatar Oct 10 '24 08:10 jonathan-eq

@berland For the sake of the tests, maybe we should rerun on error code 255 anyways. If it is the cluster acting up and not due to ssh, it wouldn't hurt to rerun the failing test.

jonathan-eq avatar Oct 18 '24 10:10 jonathan-eq

It does not look like it is currently a problem with cluster failures in LSF in our tests, so maybe hold that until it is needed.

berland avatar Oct 18 '24 10:10 berland

Should we close this one now @berland ?

eivindjahren avatar Dec 11 '24 08:12 eivindjahren