test-infra alpha/beta periodic jobs: re-enable running slow tests

What would you like to be added:

Revert https://github.com/kubernetes/test-infra/pull/34607, i.e. re-apply https://github.com/kubernetes/test-infra/pull/34584.

We had to revert because it made the job flaky. There are two issues which may have to be solved first:

The "Pods should cap back-off at MaxContainerBackOff" test runs for ~27 minutes, by design. If it doesn't get started early enough, it runs into the 1 hour time limit for these jobs (example).
Something, perhaps the same test, caused stability issues in other tests which were stable before. One symptom was failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: unable to freeze, another an unexpected exit code of 2.

Why is this needed:

Not running slow tests missed a regression.

/cc @aojea @BenTheElder

Mar 26 '25 10:03 pohly

/sig testing /triage accepted /milestone v1.34

Mar 26 '25 18:03 BenTheElder

@BenTheElder: The provided milestone is not valid for this repository. Milestones in this repository: [1.33, someday]

Use /milestone clear to clear the milestone.

In response to this:

/sig testing /triage accepted /milestone v1.34

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Mar 26 '25 18:03 k8s-ci-robot

The "Pods should cap back-off at MaxContainerBackOff" test runs for ~27 minutes, by design. If it doesn't get started early enough, it runs into the 1 hour time limit for these jobs (example).

We might need another tier like [Slow][VerySlow] or something 🤔

Mar 26 '25 18:03 BenTheElder

I would Like to work on it if you permit @BenTheElder Should I add separate timeout for different scenarios for slow and very slow and then add an verslow label to kubernetes main repo ?
/assign

May 31 '25 08:05 swastik959

cc @aojea

I think:

The "Pods should cap back-off at MaxContainerBackOff" test runs for ~27 minutes, by design. If it doesn't get started early enough, it runs into the 1 hour time limit for these jobs (example).

This test could almost be categorized as Disruptive.

I'm hesitant to add a label like VerySlow that most of our CI jobs will not be configured to skip (we need to not only consider these specific jobs) and that may encourage writing more of these ridiculously slow tests.

Jun 06 '25 16:06 BenTheElder

I was discussing with @onsi that it would be useful to prioritize slow tests so that they get started early and then overlap with most of the other tests, without running into the job timeout. It's not available yet in Ginkgo.

Jun 06 '25 18:06 pohly

I was discussing with @onsi that it would be useful to prioritize slow tests so that they get started early and then overlap with most of the other tests, without running into the job timeout. It's not available yet in Ginkgo.

I do not know if that will be a good solution, as this seems a kind of knapsack problem, usually Slow test consume more resources, so it practically will serialize the first interval, when running a mix of Slow and Fast seems the most optimal at first sight.

Also, I think we do not need to generalize much since we seems to have only one problematic test

The "Pods should cap back-off at MaxContainerBackOff" test runs for ~27 minutes, by design.

that test is indeed Disruptive to me, a test that takes 27 minutes has to make a lot of assumptions, and tagging it as Disruptive seems to solve all the existing problems

Jun 08 '25 10:06 aojea

usually Slow test consume more resources

I'm not sure about that. In my experience, they are slow because they have to wait for timeouts. That doesn't consume much resources because there is no active processing.

So unless a slow test is also marked as serial or disruptive, it can run fine in parallel to other tests.

Also, I think we do not need to generalize much since we seems to have only one problematic test

This came up before for some other slow test.

Jun 08 '25 12:06 pohly

I'm not sure about that. In my experience, they are slow because they have to wait for timeouts. That doesn't consume much resources because there is no active processing.

I take it back, I was generalizing one specific test I had in mind, I took a look at https://testgrid.k8s.io/sig-testing-kind#conformance,%20master%20(dev)&graph-metrics=test-duration-minutes and your understanding is more correct than mine

This came up before for some other slow test.

what is the criteria then for "veryslow"?

Jun 10 '25 00:06 aojea

We don't have "veryslow" right now. If we had Ginkgo priorities, then perhaps adding it would make sense:

slow: > 5 minutes, priority 1 instead of 0 for non-slow tests
veryslow: > 20 minutes, priority 2

Then jobs which need to finish quickly (= less than 20 minutes) could filter out "veryslow". All other jobs start it first and it should have a chance to complete.

The 20 minutes threshold is open for debate...

Jun 10 '25 07:06 pohly

A unit test with fake clock may be more reasonable for these behaviors than a ~30m e2e test.

I'm not sure we should be optimizing to enable this.

Jun 10 '25 15:06 BenTheElder