toil Flaky test: Kubernetes CWL conformance tests can still fail due to pods getting stuck in ContainerCreating

Flaky test: Kubernetes CWL conformance tests can still fail due to pods getting stuck in ContainerCreating

Open adamnovak opened this issue 3 years ago • 1 comments

trafficstars

In https://ucsc-ci.com/databiosphere/toil/-/jobs/25890 the CI failed after a PR was accepted. It looks like one of the CWL Kubernetes conformance tests failed because a pod kept getting stuck in ContainerCreating state every time we launched it. It probably was scheduling onto the same flaky node repeatedly.

We should either add some kind of anti-affinity to try and send pods to nodes other than the ones they just failed on, or we should just make the Kubernetes CWL conformance tests run only on launched clusters and not on the UCSC Kubernetes cluster, which we don't seem to be able to keep in a sufficiently reliably working state. Maybe we can just cut the x86 tests altogether since I think we have an ARM test for this?

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1238 ┆Link To Issue: https://ucsc-cgl.atlassian.net/browse/TOIL-1238

Nov 04 '22 15:11 adamnovak

➤ Lon Blauvelt commented:

Hopefully our new Phoenix infra fixes this by being more reliable?

Feb 22 '24 18:02 unito-bot

toil toil copied to clipboard

Flaky test: Kubernetes CWL conformance tests can still fail due to pods getting stuck in ContainerCreating

toil
toil copied to clipboard