kueue icon indicating copy to clipboard operation
kueue copied to clipboard

Jobs flake on fetching dependencies during building image

Open mimowo opened this issue 1 year ago • 5 comments

What happened:

Both periodic-kueue-test-multikueue-e2e-main and pull-kueue-test-multikueue-e2e-main flake, most often.

Example from pull-kueue-test-multikueue-e2e-main: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/3256/pull-kueue-test-multikueue-e2e-main/1848686153683701760 Examples from periodic-kueue-test-multikueue-e2e-main:

  • https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-kueue-test-multikueue-e2e-main/1846795226719457280
  • https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/3256/pull-kueue-test-multikueue-e2e-main/1848686153683701760
  • https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/3284/pull-kueue-test-multikueue-e2e-main/1848739383822258176

However, analogous flake was also observed for pull-kueue-test-unit-main .

What you expected to happen:

No flakes

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Example log:

go: sigs.k8s.io/[email protected]: Get "https://proxy.golang.org/sigs.k8s.io/kind/@v/v0.24.0.info": net/http: TLS handshake timeout
go: sigs.k8s.io/kind@: version must not be empty

or

go: sigs.k8s.io/kustomize/kustomize/[email protected]: sigs.k8s.io/kustomize/kustomize/[email protected]: verifying module: sigs.k8s.io/kustomize/kustomize/[email protected]: Get "https://sum.golang.org/lookup/sigs.k8s.io/kustomize/kustomize/[email protected]": net/http: TLS handshake timeout
make: *** [Makefile-deps.mk:55: kustomize] Error 1

Example from pull-kueue-test-unit-main :

go: downloading gotest.tools/gotestsum v1.12.0
go: gotest.tools/[email protected][2](https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/3292/pull-kueue-test-unit-main/1849084700195295232#1:build-log.txt%3A2).0: gotest.tools/[email protected]: verifying module: gotest.tools/[email protected]: Get "https://sum.golang.org/lookup/gotest.tools/[email protected]": dial tcp 142.250.190.49:44[3](https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/3292/pull-kueue-test-unit-main/1849084700195295232#1:build-log.txt%3A3): i/o timeout
make: *** [Makefile-deps.mk:70: gotestsum] Error 1

mimowo avatar Oct 23 '24 07:10 mimowo

/kind flake /cc @trasc @mbobrovskyi @alculquicondor I know this is not in test code but it looks like infra, still opened the issue as it seemingly only happens for e2e multikueue jobs. Maybe the job is bigger and resource constrained? In that case we could bump the resources in test-infra. Or maybe it is just bad "luck" - I don't know. Opening to hear some ideas.

mimowo avatar Oct 23 '24 07:10 mimowo

Actually, I spotted the similar failure today for unit tests: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/3292/pull-kueue-test-unit-main/1849084700195295232

go: downloading gotest.tools/gotestsum v1.12.0
go: gotest.tools/[email protected][2](https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/3292/pull-kueue-test-unit-main/1849084700195295232#1:build-log.txt%3A2).0: gotest.tools/[email protected]: verifying module: gotest.tools/[email protected]: Get "https://sum.golang.org/lookup/gotest.tools/[email protected]": dial tcp 142.250.190.49:44[3](https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/3292/pull-kueue-test-unit-main/1849084700195295232#1:build-log.txt%3A3): i/o timeout
make: *** [Makefile-deps.mk:70: gotestsum] Error 1

so, probably it affects all jobs :( not sure if there is anything we could do about it. cc @tenzen-y

mimowo avatar Oct 23 '24 13:10 mimowo

/kind flake /cc @trasc @mbobrovskyi @alculquicondor I know this is not in test code but it looks like infra, still opened the issue as it seemingly only happens for e2e multikueue jobs. Maybe the job is bigger and resource constrained? In that case we could bump the resources in test-infra. Or maybe it is just bad "luck" - I don't know. Opening to hear some ideas.

It seems that resource usage for the multikueue e2e Jobs sometimes overload: https://monitoring-eks.prow.k8s.io/d/96Q8oOOZk/builds?orgId=1&from=now-7d&to=now&var-org=kubernetes-sigs&var-repo=kueue&var-job=pull-kueue-test-multikueue-e2e-main&var-build=All&refresh=30s

tenzen-y avatar Oct 23 '24 22:10 tenzen-y

I've updated the title to reflect this is not just multikueue, even though it seems the most common there.

EDIT: also started a thread on k8s-infra: https://kubernetes.slack.com/archives/CCK68P2Q2/p1729753129367449

mimowo avatar Oct 24 '24 06:10 mimowo

I've updated the title to reflect this is not just multikueue, even though it seems the most common there.

Thanks.

tenzen-y avatar Oct 24 '24 07:10 tenzen-y

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 22 '25 07:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 21 '25 08:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Mar 23 '25 08:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Mar 23 '25 08:03 k8s-ci-robot