toil
toil copied to clipboard
Flaky test: Kubernetes CWL conformance tests can still fail due to pods getting stuck in ContainerCreating
In https://ucsc-ci.com/databiosphere/toil/-/jobs/25890 the CI failed after a PR was accepted. It looks like one of the CWL Kubernetes conformance tests failed because a pod kept getting stuck in ContainerCreating state every time we launched it. It probably was scheduling onto the same flaky node repeatedly.
We should either add some kind of anti-affinity to try and send pods to nodes other than the ones they just failed on, or we should just make the Kubernetes CWL conformance tests run only on launched clusters and not on the UCSC Kubernetes cluster, which we don't seem to be able to keep in a sufficiently reliably working state. Maybe we can just cut the x86 tests altogether since I think we have an ARM test for this?
┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1238 ┆Link To Issue: https://ucsc-cgl.atlassian.net/browse/TOIL-1238
➤ Lon Blauvelt commented:
Hopefully our new Phoenix infra fixes this by being more reliable?