alpha/beta periodic jobs: re-enable running slow tests
What would you like to be added:
Revert https://github.com/kubernetes/test-infra/pull/34607, i.e. re-apply https://github.com/kubernetes/test-infra/pull/34584.
We had to revert because it made the job flaky. There are two issues which may have to be solved first:
- The "Pods should cap back-off at MaxContainerBackOff" test runs for ~27 minutes, by design. If it doesn't get started early enough, it runs into the 1 hour time limit for these jobs (example).
- Something, perhaps the same test, caused stability issues in other tests which were stable before. One symptom was
failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: unable to freeze, another an unexpected exit code of 2.
Why is this needed:
Not running slow tests missed a regression.
/cc @aojea @BenTheElder
/sig testing /triage accepted /milestone v1.34
@BenTheElder: The provided milestone is not valid for this repository. Milestones in this repository: [1.33, someday]
Use /milestone clear to clear the milestone.
In response to this:
/sig testing /triage accepted /milestone v1.34
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
The "Pods should cap back-off at MaxContainerBackOff" test runs for ~27 minutes, by design. If it doesn't get started early enough, it runs into the 1 hour time limit for these jobs (example).
We might need another tier like [Slow][VerySlow] or something 🤔
I would Like to work on it if you permit
@BenTheElder Should I add separate timeout for different scenarios for slow and very slow and then add an verslow label to kubernetes main repo ?
/assign
cc @aojea
I think:
The "Pods should cap back-off at MaxContainerBackOff" test runs for ~27 minutes, by design. If it doesn't get started early enough, it runs into the 1 hour time limit for these jobs (example).
This test could almost be categorized as Disruptive.
I'm hesitant to add a label like VerySlow that most of our CI jobs will not be configured to skip (we need to not only consider these specific jobs) and that may encourage writing more of these ridiculously slow tests.
I was discussing with @onsi that it would be useful to prioritize slow tests so that they get started early and then overlap with most of the other tests, without running into the job timeout. It's not available yet in Ginkgo.
I was discussing with @onsi that it would be useful to prioritize slow tests so that they get started early and then overlap with most of the other tests, without running into the job timeout. It's not available yet in Ginkgo.
I do not know if that will be a good solution, as this seems a kind of knapsack problem, usually Slow test consume more resources, so it practically will serialize the first interval, when running a mix of Slow and Fast seems the most optimal at first sight.
Also, I think we do not need to generalize much since we seems to have only one problematic test
The "Pods should cap back-off at MaxContainerBackOff" test runs for ~27 minutes, by design.
that test is indeed Disruptive to me, a test that takes 27 minutes has to make a lot of assumptions, and tagging it as Disruptive seems to solve all the existing problems
usually Slow test consume more resources
I'm not sure about that. In my experience, they are slow because they have to wait for timeouts. That doesn't consume much resources because there is no active processing.
So unless a slow test is also marked as serial or disruptive, it can run fine in parallel to other tests.
Also, I think we do not need to generalize much since we seems to have only one problematic test
This came up before for some other slow test.
I'm not sure about that. In my experience, they are slow because they have to wait for timeouts. That doesn't consume much resources because there is no active processing.
I take it back, I was generalizing one specific test I had in mind, I took a look at https://testgrid.k8s.io/sig-testing-kind#conformance,%20master%20(dev)&graph-metrics=test-duration-minutes and your understanding is more correct than mine
This came up before for some other slow test.
what is the criteria then for "veryslow"?
We don't have "veryslow" right now. If we had Ginkgo priorities, then perhaps adding it would make sense:
- slow: > 5 minutes, priority 1 instead of 0 for non-slow tests
- veryslow: > 20 minutes, priority 2
Then jobs which need to finish quickly (= less than 20 minutes) could filter out "veryslow". All other jobs start it first and it should have a chance to complete.
The 20 minutes threshold is open for debate...
A unit test with fake clock may be more reasonable for these behaviors than a ~30m e2e test.
I'm not sure we should be optimizing to enable this.