kueue Jobs flake on fetching dependencies during building image

What happened:

Both periodic-kueue-test-multikueue-e2e-main and pull-kueue-test-multikueue-e2e-main flake, most often.

Example from pull-kueue-test-multikueue-e2e-main: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/3256/pull-kueue-test-multikueue-e2e-main/1848686153683701760 Examples from periodic-kueue-test-multikueue-e2e-main:

https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-kueue-test-multikueue-e2e-main/1846795226719457280
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/3256/pull-kueue-test-multikueue-e2e-main/1848686153683701760
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/3284/pull-kueue-test-multikueue-e2e-main/1848739383822258176

However, analogous flake was also observed for pull-kueue-test-unit-main .

What you expected to happen:

No flakes

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Example log:

go: sigs.k8s.io/[email protected]: Get "https://proxy.golang.org/sigs.k8s.io/kind/@v/v0.24.0.info": net/http: TLS handshake timeout
go: sigs.k8s.io/kind@: version must not be empty

or

go: sigs.k8s.io/kustomize/kustomize/[email protected]: sigs.k8s.io/kustomize/kustomize/[email protected]: verifying module: sigs.k8s.io/kustomize/kustomize/[email protected]: Get "https://sum.golang.org/lookup/sigs.k8s.io/kustomize/kustomize/[email protected]": net/http: TLS handshake timeout
make: *** [Makefile-deps.mk:55: kustomize] Error 1

Example from pull-kueue-test-unit-main :

go: downloading gotest.tools/gotestsum v1.12.0
go: gotest.tools/[email protected][2](https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/3292/pull-kueue-test-unit-main/1849084700195295232#1:build-log.txt%3A2).0: gotest.tools/[email protected]: verifying module: gotest.tools/[email protected]: Get "https://sum.golang.org/lookup/gotest.tools/[email protected]": dial tcp 142.250.190.49:44[3](https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/3292/pull-kueue-test-unit-main/1849084700195295232#1:build-log.txt%3A3): i/o timeout
make: *** [Makefile-deps.mk:70: gotestsum] Error 1

Oct 23 '24 07:10 mimowo

/kind flake /cc @trasc @mbobrovskyi @alculquicondor I know this is not in test code but it looks like infra, still opened the issue as it seemingly only happens for e2e multikueue jobs. Maybe the job is bigger and resource constrained? In that case we could bump the resources in test-infra. Or maybe it is just bad "luck" - I don't know. Opening to hear some ideas.

Oct 23 '24 07:10 mimowo

Actually, I spotted the similar failure today for unit tests: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/3292/pull-kueue-test-unit-main/1849084700195295232

go: downloading gotest.tools/gotestsum v1.12.0
go: gotest.tools/[email protected][2](https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/3292/pull-kueue-test-unit-main/1849084700195295232#1:build-log.txt%3A2).0: gotest.tools/[email protected]: verifying module: gotest.tools/[email protected]: Get "https://sum.golang.org/lookup/gotest.tools/[email protected]": dial tcp 142.250.190.49:44[3](https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/3292/pull-kueue-test-unit-main/1849084700195295232#1:build-log.txt%3A3): i/o timeout
make: *** [Makefile-deps.mk:70: gotestsum] Error 1

so, probably it affects all jobs :( not sure if there is anything we could do about it. cc @tenzen-y

Oct 23 '24 13:10 mimowo

/kind flake /cc @trasc @mbobrovskyi @alculquicondor I know this is not in test code but it looks like infra, still opened the issue as it seemingly only happens for e2e multikueue jobs. Maybe the job is bigger and resource constrained? In that case we could bump the resources in test-infra. Or maybe it is just bad "luck" - I don't know. Opening to hear some ideas.

It seems that resource usage for the multikueue e2e Jobs sometimes overload: https://monitoring-eks.prow.k8s.io/d/96Q8oOOZk/builds?orgId=1&from=now-7d&to=now&var-org=kubernetes-sigs&var-repo=kueue&var-job=pull-kueue-test-multikueue-e2e-main&var-build=All&refresh=30s

Oct 23 '24 22:10 tenzen-y

I've updated the title to reflect this is not just multikueue, even though it seems the most common there.

EDIT: also started a thread on k8s-infra: https://kubernetes.slack.com/archives/CCK68P2Q2/p1729753129367449

Oct 24 '24 06:10 mimowo

I've updated the title to reflect this is not just multikueue, even though it seems the most common there.

Thanks.

Oct 24 '24 07:10 tenzen-y

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 22 '25 07:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 21 '25 08:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Mar 23 '25 08:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Mar 23 '25 08:03 k8s-ci-robot

kueue kueue copied to clipboard

Jobs flake on fetching dependencies during building image

kueue
kueue copied to clipboard