kueue icon indicating copy to clipboard operation
kueue copied to clipboard

fix: use agnhost images preloaded to the containerd in tests

Open mykysha opened this issue 5 months ago • 20 comments

What type of PR is this?

/kind failing-test /kind flake

What this PR does / why we need it:

Remove digest from the agnhost images in e2e tests so that they are not re-pulled.

Which issue(s) this PR fixes:

Fixes #5639 Fixes #5640

Special notes for your reviewer:

Agnhost image is already being preloaded in the hack/e2e-common.sh:

# agnhost image to use for testing.
export E2E_TEST_AGNHOST_IMAGE_OLD=registry.k8s.io/e2e-test-images/agnhost:2.52@sha256:b173c7d0ffe3d805d49f4dfe48375169b7b8d2e1feb81783efd61eb9d08042e6
E2E_TEST_AGNHOST_IMAGE_OLD_WITHOUT_SHA=${E2E_TEST_AGNHOST_IMAGE_OLD%%@*}
export E2E_TEST_AGNHOST_IMAGE=registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85
E2E_TEST_AGNHOST_IMAGE_WITHOUT_SHA=${E2E_TEST_AGNHOST_IMAGE%%@*}

However, due to the preloaded image being loaded to the containerd without the sha256 digest, and the tests strictly requiring the use of an image with the digest, the images were being pulled again, resulting in some flaky behavior when any network errors occurred.

With the sha256 digests removed from the test images the tests can be ran without the internet connection after the initial setup, something that would not be possible before.

Does this PR introduce a user-facing change?

NONE

mykysha avatar Jun 26 '25 11:06 mykysha

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
Latest commit 99374edaf5891e47835aa82705416e6c5664b02c
Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/68650e1114c7ab0008f4a4ed
Deploy Preview https://deploy-preview-5780--kubernetes-sigs-kueue.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

netlify[bot] avatar Jun 26 '25 11:06 netlify[bot]

/cc @mbobrovskyi

mykysha avatar Jun 26 '25 11:06 mykysha

Awesome! Thank you so much!

/lgtm

mbobrovskyi avatar Jun 26 '25 11:06 mbobrovskyi

LGTM label has been added.

Git tree hash: 26019329a938a745480d6b04aa5fcbcdd11c7712

k8s-ci-robot avatar Jun 26 '25 11:06 k8s-ci-robot

/cc @mimowo

mbobrovskyi avatar Jun 26 '25 11:06 mbobrovskyi

Could you provide some explanation why this would actually fix the issues mentioned?

In particular, why the tests would fail due to re-pulling of the image.

Looking at the failed asserts it does not seem related directly to the image management. For example the tests show that 67/69 tests passed, so it is not clear to me how the other tests passed and this failed.

/hold I want to first understand what is the relationship between re-pulling the image and failures in the specific issues.

mimowo avatar Jun 26 '25 11:06 mimowo

In e2e-common.sh, we define agnhost variables using both the tag and the digest:

https://github.com/kubernetes-sigs/kueue/blob/a9bde54b2fbffb5ba12a1d4cdd89fbb9b33d4a73/hack/e2e-common.sh#L65-L69

Later, we create a tag and pull the image using only the tag (without the digest):

https://github.com/kubernetes-sigs/kueue/blob/a9bde54b2fbffb5ba12a1d4cdd89fbb9b33d4a73/hack/e2e-common.sh#L105

https://github.com/kubernetes-sigs/kueue/blob/a9bde54b2fbffb5ba12a1d4cdd89fbb9b33d4a73/hack/e2e-common.sh#L133-L134

However, in tests we refer to the image using both tag and digest: https://github.com/kubernetes-sigs/kueue/blob/c08e7b1285e824391175868da59b38177c1abeda/test/util/e2e.go#L57-L60

As a result, the image is pulled again during tests using the digest, even though we already pulled it earlier using just the tag. To avoid this redundant pull, we should consistently use the tag without the digest in tests.

mbobrovskyi avatar Jun 26 '25 12:06 mbobrovskyi

Could you provide some explanation why this would actually fix the issues mentioned?

In the kubelet logs we found an error:

Jun 12 20:35:17 kind-worker2 kubelet[231]: I0612 20:35:17.902789     231 status_manager.go:911] "Patch status for pod" pod="e2e-grrwf/job-set-replicated-job-1-0-1-clwkq" podUID="26a2c16a-e30c-4b10-bf2b-8d76e937fc4e" patch="{\"metadata\":{\"uid\":\"26a2c16a-e30c-4b10-bf2b-8d76e937fc4e\"},\"status\":{\"containerStatuses\":[{\"image\":\"registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\",\"imageID\":\"\",\"lastState\":{},\"name\":\"c\",\"ready\":false,\"restartCount\":0,\"started\":false,\"state\":{\"waiting\":{\"message\":\"Back-off pulling image \\\"registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\\\": ErrImagePull: failed to pull and unpack image \\\"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\\\": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry.k8s.io/v2/e2e-test-images/agnhost/manifests/sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85: 500 Internal Server Error - Server message: unknown: Internal error encountered.\",\"reason\":\"ImagePullBackOff\"}},\"volumeMounts\":[{\"mountPath\":\"/var/run/secrets/kubernetes.io/serviceaccount\",\"name\":\"kube-api-access-jl6wk\",\"readOnly\":true,\"recursiveReadOnly\":\"Disabled\"}]}]}}"
Jun 12 20:35:17 kind-worker2 kubelet[231]: I0612 20:35:17.902911     231 status_manager.go:920] "Status for pod updated successfully" pod="e2e-grrwf/job-set-replicated-job-1-0-1-clwkq" statusVersion=5 status={"phase":"Pending","conditions":[{"type":"PodReadyToStartContainers","status":"True","lastProbeTime":null,"lastTransitionTime":"2025-06-12T20:34:36Z"},{"type":"Initialized","status":"True","lastProbeTime":null,"lastTransitionTime":"2025-06-12T20:34:34Z"},{"type":"Ready","status":"False","lastProbeTime":null,"lastTransitionTime":"2025-06-12T20:34:34Z","reason":"ContainersNotReady","message":"containers with unready status: [c]"},{"type":"ContainersReady","status":"False","lastProbeTime":null,"lastTransitionTime":"2025-06-12T20:34:34Z","reason":"ContainersNotReady","message":"containers with unready status: [c]"},{"type":"PodScheduled","status":"True","lastProbeTime":null,"lastTransitionTime":"2025-06-12T20:34:34Z"}],"hostIP":"172.18.0.2","hostIPs":[{"ip":"172.18.0.2"}],"podIP":"10.244.2.6","podIPs":[{"ip":"10.244.2.6"}],"startTime":"2025-06-12T20:34:34Z","containerStatuses":[{"name":"c","state":{"waiting":{"reason":"ImagePullBackOff","message":"Back-off pulling image \"registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": ErrImagePull: failed to pull and unpack image \"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry.k8s.io/v2/e2e-test-images/agnhost/manifests/sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85: 500 Internal Server Error - Server message: unknown: Internal error encountered."}},"lastState":{},"ready":false,"restartCount":0,"image":"registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85","imageID":"","started":false,"volumeMounts":[{"name":"kube-api-access-jl6wk","mountPath":"/var/run/secrets/kubernetes.io/serviceaccount","readOnly":true,"recursiveReadOnly":"Disabled"}]}],"qosClass":"Guaranteed"}
Jun 12 20:35:18 kind-worker2 kubelet[231]: I0612 20:35:18.053666     231 projected.go:185] Setting up volume kube-api-access-jl6wk for pod 26a2c16a-e30c-4b10-bf2b-8d76e937fc4e at /var/lib/kubelet/pods/26a2c16a-e30c-4b10-bf2b-8d76e937fc4e/volumes/kubernetes.io~projected/kube-api-access-jl6wk
Jun 12 20:35:18 kind-worker2 kubelet[231]: E0612 20:35:18.515321     231 log.go:32] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry.k8s.io/v2/e2e-test-images/agnhost/manifests/sha256:1c5d47ecd9c4fca235ec0eeb9af0c39d8dd981ae703805a1f23676a9bf47c3bb: 500 Internal Server Error - Server message: unknown: Internal error encountered." image="registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85"
Jun 12 20:35:18 kind-worker2 kubelet[231]: E0612 20:35:18.515389     231 kuberuntime_image.go:55] "Failed to pull image" err="failed to pull and unpack image \"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry.k8s.io/v2/e2e-test-images/agnhost/manifests/sha256:1c5d47ecd9c4fca235ec0eeb9af0c39d8dd981ae703805a1f23676a9bf47c3bb: 500 Internal Server Error - Server message: unknown: Internal error encountered." image="registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85"
Jun 12 20:35:18 kind-worker2 kubelet[231]: I0612 20:35:18.515486     231 kuberuntime_image.go:51] "Pulling image without credentials" image="registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85"
Jun 12 20:35:18 kind-worker2 kubelet[231]: E0612 20:35:18.515644     231 kuberuntime_manager.go:1341] "Unhandled Error" err="container &Container{Name:c,Image:registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85,Command:[],Args:[entrypoint-tester],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:JOB_COMPLETION_INDEX,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.labels['batch.kubernetes.io/job-completion-index'],},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},},Resources:ResourceRequirements{Limits:ResourceList{cpu: {{500 -3} {<nil>} 500m DecimalSI},memory: {{200 6} {<nil>} 200M DecimalSI},},Requests:ResourceList{cpu: {{500 -3} {<nil>} 500m DecimalSI},memory: {{200 6} {<nil>} 200M DecimalSI},},Claims:[]ResourceClaim{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:kube-api-access-dz62k,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod job-set-replicated-job-1-0-0-clmt9_e2e-grrwf(a57da650-219f-454c-8dc7-e64aae90ad2e): ErrImagePull: failed to pull and unpack image \"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry.k8s.io/v2/e2e-test-images/agnhost/manifests/sha256:1c5d47ecd9c4fca235ec0eeb9af0c39d8dd981ae703805a1f23676a9bf47c3bb: 500 Internal Server Error - Server message: unknown: Internal error encountered." logger="UnhandledError"
Jun 12 20:35:18 kind-worker2 kubelet[231]: I0612 20:35:18.515716     231 event.go:389] "Event occurred" object="e2e-grrwf/job-set-replicated-job-1-0-0-clmt9" fieldPath="spec.containers{c}" kind="Pod" apiVersion="v1" type="Warning" reason="Failed" message="Failed to pull image \"registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": failed to pull and unpack image \"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry.k8s.io/v2/e2e-test-images/agnhost/manifests/sha256:1c5d47ecd9c4fca235ec0eeb9af0c39d8dd981ae703805a1f23676a9bf47c3bb: 500 Internal Server Error - Server message: unknown: Internal error encountered."
Jun 12 20:35:18 kind-worker2 kubelet[231]: I0612 20:35:18.515743     231 event.go:389] "Event occurred" object="e2e-grrwf/job-set-replicated-job-1-0-0-clmt9" fieldPath="spec.containers{c}" kind="Pod" apiVersion="v1" type="Warning" reason="Failed" message="Error: ErrImagePull"
Jun 12 20:35:18 kind-worker2 kubelet[231]: E0612 20:35:18.517118     231 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"c\" with ErrImagePull: \"failed to pull and unpack image \\\"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\\\": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry.k8s.io/v2/e2e-test-images/agnhost/manifests/sha256:1c5d47ecd9c4fca235ec0eeb9af0c39d8dd981ae703805a1f23676a9bf47c3bb: 500 Internal Server Error - Server message: unknown: Internal error encountered.\"" pod="e2e-grrwf/job-set-replicated-job-1-0-0-clmt9" podUID="a57da650-219f-454c-8dc7-e64aae90ad2e"
Jun 12 20:35:19 kind-worker2 kubelet[231]: E0612 20:35:19.001971     231 log.go:32] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": failed to resolve reference \"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": unexpected status from HEAD request to https://us-central1-docker.pkg.dev/v2/k8s-artifacts-prod/images/e2e-test-images/agnhost/manifests/sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85: 500 Internal Server Error" image="registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85"

That means we're trying to pull the image during the test because it's not available locally with both the tag and digest.

Looking at the failed asserts it does not seem related directly to the image management. For example the tests show that 67/69 tests passed, so it is not clear to me how the other tests passed and this failed.

I think this can be possible to fail due to a network issue, but it shouldn't happen in tests since we've already pulled the image in e2e-common.sh.

mbobrovskyi avatar Jun 26 '25 12:06 mbobrovskyi

We’re seeing the same issue in another test case (related to LeaderWorkerSet). In the kubelet logs, the same error appears:

Jun 12 20:26:32 kind-worker kubelet[231]: I0612 20:26:32.394295     231 kuberuntime_image.go:51] "Pulling image without credentials" image="registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85"
Jun 12 20:26:32 kind-worker kubelet[231]: I0612 20:26:32.394387     231 event.go:389] "Event occurred" object="lws-e2e-9rbc8/lws-0" fieldPath="spec.containers{c}" kind="Pod" apiVersion="v1" type="Normal" reason="Pulling" message="Pulling image \"registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\""
Jun 12 20:26:32 kind-worker kubelet[231]: I0612 20:26:32.404255     231 status_manager.go:911] "Patch status for pod" pod="lws-e2e-9rbc8/lws-0" podUID="687af386-2fd4-477a-b817-f944efe5019b" patch="{\"metadata\":{\"uid\":\"687af386-2fd4-477a-b817-f944efe5019b\"},\"status\":{\"containerStatuses\":[{\"image\":\"registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\",\"imageID\":\"\",\"lastState\":{},\"name\":\"c\",\"ready\":false,\"restartCount\":0,\"started\":false,\"state\":{\"waiting\":{\"message\":\"Back-off pulling image \\\"registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\\\": ErrImagePull: failed to pull and unpack image \\\"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\\\": failed to resolve reference \\\"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\\\": unexpected status from HEAD request to https://us-central1-docker.pkg.dev/v2/k8s-artifacts-prod/images/e2e-test-images/agnhost/manifests/sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85: 500 Internal Server Error\",\"reason\":\"ImagePullBackOff\"}},\"volumeMounts\":[{\"mountPath\":\"/var/run/secrets/kubernetes.io/serviceaccount\",\"name\":\"kube-api-access-5sxhp\",\"readOnly\":true,\"recursiveReadOnly\":\"Disabled\"}]}]}}"
Jun 12 20:26:32 kind-worker kubelet[231]: I0612 20:26:32.404351     231 status_manager.go:920] "Status for pod updated successfully" pod="lws-e2e-9rbc8/lws-0" statusVersion=5 status={"phase":"Pending","conditions":[{"type":"PodReadyToStartContainers","status":"True","lastProbeTime":null,"lastTransitionTime":"2025-06-12T20:25:55Z"},{"type":"Initialized","status":"True","lastProbeTime":null,"lastTransitionTime":"2025-06-12T20:25:49Z"},{"type":"Ready","status":"False","lastProbeTime":null,"lastTransitionTime":"2025-06-12T20:25:49Z","reason":"ContainersNotReady","message":"containers with unready status: [c]"},{"type":"ContainersReady","status":"False","lastProbeTime":null,"lastTransitionTime":"2025-06-12T20:25:49Z","reason":"ContainersNotReady","message":"containers with unready status: [c]"},{"type":"PodScheduled","status":"True","lastProbeTime":null,"lastTransitionTime":"2025-06-12T20:25:49Z"}],"hostIP":"172.18.0.3","hostIPs":[{"ip":"172.18.0.3"}],"podIP":"10.244.1.7","podIPs":[{"ip":"10.244.1.7"}],"startTime":"2025-06-12T20:25:49Z","containerStatuses":[{"name":"c","state":{"waiting":{"reason":"ImagePullBackOff","message":"Back-off pulling image \"registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": ErrImagePull: failed to pull and unpack image \"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": failed to resolve reference \"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": unexpected status from HEAD request to https://us-central1-docker.pkg.dev/v2/k8s-artifacts-prod/images/e2e-test-images/agnhost/manifests/sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85: 500 Internal Server Error"}},"lastState":{},"ready":false,"restartCount":0,"image":"registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85","imageID":"","started":false,"volumeMounts":[{"name":"kube-api-access-5sxhp","mountPath":"/var/run/secrets/kubernetes.io/serviceaccount","readOnly":true,"recursiveReadOnly":"Disabled"}]}],"qosClass":"Burstable"}
Jun 12 20:26:32 kind-worker kubelet[231]: I0612 20:26:32.470523     231 projected.go:185] Setting up volume kube-api-access-f2bxl for pod 062c5331-7da8-4cd8-82d7-805a87ab2f41 at /var/lib/kubelet/pods/062c5331-7da8-4cd8-82d7-805a87ab2f41/volumes/kubernetes.io~projected/kube-api-access-f2bxl
Jun 12 20:26:32 kind-worker kubelet[231]: I0612 20:26:32.470620     231 projected.go:185] Setting up volume kube-api-access-5sxhp for pod 687af386-2fd4-477a-b817-f944efe5019b at /var/lib/kubelet/pods/687af386-2fd4-477a-b817-f944efe5019b/volumes/kubernetes.io~projected/kube-api-access-5sxhp
Jun 12 20:26:32 kind-worker kubelet[231]: I0612 20:26:32.470622     231 configmap.go:181] Setting up volume kube-proxy for pod 062c5331-7da8-4cd8-82d7-805a87ab2f41 at /var/lib/kubelet/pods/062c5331-7da8-4cd8-82d7-805a87ab2f41/volumes/kubernetes.io~configmap/kube-proxy
Jun 12 20:26:32 kind-worker kubelet[231]: I0612 20:26:32.470683     231 configmap.go:205] Received configMap kube-system/kube-proxy containing (2) pieces of data, 1755 total bytes
Jun 12 20:26:32 kind-worker kubelet[231]: I0612 20:26:32.470746     231 quota_linux.go:276] SupportsQuotas called, but quotas disabled
Jun 12 20:26:32 kind-worker kubelet[231]: I0612 20:26:32.470758     231 empty_dir.go:305] assignQuota called, hasQuotas = false userNamespacesEnabled = false
Jun 12 20:26:33 kind-worker kubelet[231]: I0612 20:26:33.001374     231 prober.go:116] "Probe succeeded" probeType="Readiness" pod="default/kuberay-operator-58c5788645-72j6m" podUID="14d17dd4-f0c7-498d-8834-7fa917d02ef0" containerName="kuberay-operator"

But this occurs on another worker node that also doesn’t have the agnhost image. The only issue we can identify is that Kubernetes attempts to pull the image, and something goes wrong on the registry side during that process, resulting in an error. This doesn’t seem to be related to Kueue.

mbobrovskyi avatar Jun 26 '25 12:06 mbobrovskyi

Cool so I see the image is pulled during the test run, but why does it cause the test failure? is it because kueue is restarted just after the new image is pulled?

mimowo avatar Jun 26 '25 15:06 mimowo

Cool so I see the image is pulled during the test run, but why does it cause the test failure? is it because kueue is restarted just after the new image is pulled?

No, this is singlecluster and we are not restarting Kueue here. I think it's just network issue.

mbobrovskyi avatar Jun 26 '25 15:06 mbobrovskyi

Do you know why the network issue manifests with the assert failures like these?

mimowo avatar Jun 26 '25 16:06 mimowo

In any case I'm ok to merge it , but I'm still why the network issue downloading the image results with the assert failures for kueue

mimowo avatar Jun 26 '25 16:06 mimowo

Doesn't this occur the previous problem, again?

Previously, we tried using just the image digest. However, we ran into an issue when loading the image by digest into Kind using kind load docker-image.

https://github.com/kubernetes-sigs/kueue/blob/a9bde54b2fbffb5ba12a1d4cdd89fbb9b33d4a73/hack/e2e-common.sh#L101-L103

That’s why we decided to pull the Docker image by its digest

https://github.com/kubernetes-sigs/kueue/blob/a9bde54b2fbffb5ba12a1d4cdd89fbb9b33d4a73/hack/e2e-common.sh#L98-L99

manually create a tag for it

https://github.com/kubernetes-sigs/kueue/blob/a9bde54b2fbffb5ba12a1d4cdd89fbb9b33d4a73/hack/e2e-common.sh#L104-L105

and then load it into Kind without digest.

https://github.com/kubernetes-sigs/kueue/blob/a9bde54b2fbffb5ba12a1d4cdd89fbb9b33d4a73/hack/e2e-common.sh#L133-L134

After loading, we can use the tag without any problems.

mbobrovskyi avatar Jun 26 '25 17:06 mbobrovskyi

Do you know why the network issue manifests with the assert failures like these?

To be honest, I'm not sure. But we do occasionally have network issues in Prow, so it might be related to that.

mbobrovskyi avatar Jun 26 '25 17:06 mbobrovskyi

Do you know why the network issue manifests with the assert failures like these?

@mbobrovskyi do you know an answer to that? Again, not a blocker but I would like to know if I'm missing something obvious or we are just guessing based on correlations

mimowo avatar Jun 26 '25 17:06 mimowo

Doesn't this occur the previous problem, again?

Previously, we tried using just the image digest. However, we ran into an issue when loading the image by digest into Kind using kind load docker-image.

https://github.com/kubernetes-sigs/kueue/blob/a9bde54b2fbffb5ba12a1d4cdd89fbb9b33d4a73/hack/e2e-common.sh#L101-L103

That’s why we decided to pull the Docker image by its digest

https://github.com/kubernetes-sigs/kueue/blob/a9bde54b2fbffb5ba12a1d4cdd89fbb9b33d4a73/hack/e2e-common.sh#L98-L99

manually create a tag for it

https://github.com/kubernetes-sigs/kueue/blob/a9bde54b2fbffb5ba12a1d4cdd89fbb9b33d4a73/hack/e2e-common.sh#L104-L105

and then load it into Kind without digest.

https://github.com/kubernetes-sigs/kueue/blob/a9bde54b2fbffb5ba12a1d4cdd89fbb9b33d4a73/hack/e2e-common.sh#L133-L134

After loading, we can use the tag without any problems.

Based on the previous one, they wanted to perform in another cluster instead of Kind cluster. And the cluster requires the digest. So, once this is merged, the problem will occur again.

the ImageContentSourcePolicy(ICSP) and ClusterImageSet(CIS) are adjusted such as the images required must be referred to digest rather than tag.

tenzen-y avatar Jun 26 '25 18:06 tenzen-y

Can we keep the sha but only strip it when using in kind?

mimowo avatar Jun 26 '25 18:06 mimowo

Can we keep the sha but only strip it when using in kind?

I think we can use env variables for it and replace if we are using Kind.

mbobrovskyi avatar Jun 26 '25 18:06 mbobrovskyi

Replaced the agnhost image usage with reading from the env variable and defaulting to the one that was there before. This way, configurations that are available to pull an image to the cluster will use that image, minimizing the flaky behavior, and other configurations would work as they have been before.

In addition, I'll add another PR that would update the agnhost image version automatically using the dependabot

mykysha avatar Jul 02 '25 10:07 mykysha

/hold I want to first understand what is the relationship between re-pulling the image and failures in the specific issues.

/unhold I synced on that with @mbobrovskyi and @mykysha and IIUC the reason for the test failures is that the failing image pull (due to a transitive network issue) results in the Pod not being able to start, and thus be running or finished, as expected by the tests.

mimowo avatar Jul 02 '25 10:07 mimowo

LGTM label has been added.

Git tree hash: 9473ce87d564c43f360ce2b2ba7ff60073a566f9

k8s-ci-robot avatar Jul 02 '25 10:07 k8s-ci-robot

@mimowo @tenzen-y PTAL

mbobrovskyi avatar Jul 03 '25 08:07 mbobrovskyi

/approve /cherrypick release-0.12 /cherrypick release-0.11

mimowo avatar Jul 03 '25 09:07 mimowo

@mimowo: once the present PR merges, I will cherry-pick it on top of release-0.11, release-0.12 in new PRs and assign them to you.

In response to this:

/approve /cherrypick release-0.12 /cherrypick release-0.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mbobrovskyi, mimowo, mykysha

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jul 03 '25 09:07 k8s-ci-robot

@mimowo: #5780 failed to apply on top of branch "release-0.12":

Applying: fix: use agnhost images preloaded to the containerd in tests
Using index info to reconstruct a base tree...
M	hack/e2e-common.sh
M	test/e2e/certmanager/metrics_test.go
M	test/e2e/customconfigs/waitforpodsready_test.go
M	test/e2e/multikueue/e2e_test.go
M	test/e2e/singlecluster/fair_sharing_test.go
M	test/e2e/singlecluster/jaxjob_test.go
M	test/e2e/singlecluster/metrics_test.go
A	test/e2e/singlecluster/pytorchjob_test.go
M	test/e2e/singlecluster/visibility_test.go
M	test/e2e/tas/jobset_test.go
M	test/util/e2e.go
Falling back to patching base and 3-way merge...
Auto-merging test/util/e2e.go
CONFLICT (content): Merge conflict in test/util/e2e.go
Auto-merging test/e2e/tas/jobset_test.go
CONFLICT (content): Merge conflict in test/e2e/tas/jobset_test.go
Auto-merging test/e2e/singlecluster/visibility_test.go
CONFLICT (modify/delete): test/e2e/singlecluster/pytorchjob_test.go deleted in HEAD and modified in fix: use agnhost images preloaded to the containerd in tests. Version fix: use agnhost images preloaded to the containerd in tests of test/e2e/singlecluster/pytorchjob_test.go left in tree.
Auto-merging test/e2e/singlecluster/metrics_test.go
Auto-merging test/e2e/singlecluster/jaxjob_test.go
Auto-merging test/e2e/singlecluster/fair_sharing_test.go
Auto-merging test/e2e/multikueue/e2e_test.go
Auto-merging test/e2e/customconfigs/waitforpodsready_test.go
Auto-merging test/e2e/certmanager/metrics_test.go
Auto-merging hack/e2e-common.sh
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 fix: use agnhost images preloaded to the containerd in tests

In response to this:

/approve /cherrypick release-0.12 /cherrypick release-0.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mimowo: #5780 failed to apply on top of branch "release-0.11":

Applying: fix: use agnhost images preloaded to the containerd in tests
Using index info to reconstruct a base tree...
M	hack/e2e-common.sh
A	test/e2e/certmanager/metrics_test.go
M	test/e2e/customconfigs/managejobswithoutqueuename_test.go
A	test/e2e/customconfigs/objectretentionpolicies_test.go
A	test/e2e/customconfigs/waitforpodsready_test.go
M	test/e2e/multikueue/e2e_test.go
M	test/e2e/singlecluster/appwrapper_test.go
M	test/e2e/singlecluster/deployment_test.go
M	test/e2e/singlecluster/e2e_test.go
M	test/e2e/singlecluster/fair_sharing_test.go
A	test/e2e/singlecluster/jaxjob_test.go
M	test/e2e/singlecluster/jobset_test.go
M	test/e2e/singlecluster/leaderworkerset_test.go
M	test/e2e/singlecluster/metrics_test.go
M	test/e2e/singlecluster/pod_test.go
A	test/e2e/singlecluster/pytorchjob_test.go
M	test/e2e/singlecluster/statefulset_test.go
M	test/e2e/singlecluster/tas_test.go
M	test/e2e/singlecluster/visibility_test.go
M	test/e2e/tas/appwrapper_test.go
M	test/e2e/tas/job_test.go
M	test/e2e/tas/jobset_test.go
M	test/e2e/tas/leaderworkerset_test.go
M	test/e2e/tas/mpijob_test.go
M	test/e2e/tas/pod_group_test.go
M	test/e2e/tas/pytorch_test.go
M	test/e2e/tas/statefulset_test.go
M	test/integration/singlecluster/controller/jobs/jobset/jobset_controller_test.go
M	test/util/e2e.go
Falling back to patching base and 3-way merge...
Auto-merging test/util/e2e.go
CONFLICT (content): Merge conflict in test/util/e2e.go
Auto-merging test/integration/singlecluster/controller/jobs/jobset/jobset_controller_test.go
Auto-merging test/e2e/tas/statefulset_test.go
Auto-merging test/e2e/tas/pytorch_test.go
Auto-merging test/e2e/tas/pod_group_test.go
Auto-merging test/e2e/tas/mpijob_test.go
Auto-merging test/e2e/tas/leaderworkerset_test.go
Auto-merging test/e2e/tas/jobset_test.go
CONFLICT (content): Merge conflict in test/e2e/tas/jobset_test.go
Auto-merging test/e2e/tas/job_test.go
Auto-merging test/e2e/tas/appwrapper_test.go
Auto-merging test/e2e/singlecluster/visibility_test.go
CONFLICT (content): Merge conflict in test/e2e/singlecluster/visibility_test.go
Auto-merging test/e2e/singlecluster/tas_test.go
Auto-merging test/e2e/singlecluster/statefulset_test.go
CONFLICT (content): Merge conflict in test/e2e/singlecluster/statefulset_test.go
CONFLICT (modify/delete): test/e2e/singlecluster/pytorchjob_test.go deleted in HEAD and modified in fix: use agnhost images preloaded to the containerd in tests. Version fix: use agnhost images preloaded to the containerd in tests of test/e2e/singlecluster/pytorchjob_test.go left in tree.
Auto-merging test/e2e/singlecluster/pod_test.go
Auto-merging test/e2e/singlecluster/metrics_test.go
Auto-merging test/e2e/singlecluster/leaderworkerset_test.go
CONFLICT (content): Merge conflict in test/e2e/singlecluster/leaderworkerset_test.go
Auto-merging test/e2e/singlecluster/jobset_test.go
CONFLICT (modify/delete): test/e2e/singlecluster/jaxjob_test.go deleted in HEAD and modified in fix: use agnhost images preloaded to the containerd in tests. Version fix: use agnhost images preloaded to the containerd in tests of test/e2e/singlecluster/jaxjob_test.go left in tree.
Auto-merging test/e2e/singlecluster/fair_sharing_test.go
CONFLICT (content): Merge conflict in test/e2e/singlecluster/fair_sharing_test.go
Auto-merging test/e2e/singlecluster/e2e_test.go
CONFLICT (content): Merge conflict in test/e2e/singlecluster/e2e_test.go
Auto-merging test/e2e/singlecluster/deployment_test.go
Auto-merging test/e2e/singlecluster/appwrapper_test.go
Auto-merging test/e2e/multikueue/e2e_test.go
CONFLICT (modify/delete): test/e2e/customconfigs/waitforpodsready_test.go deleted in HEAD and modified in fix: use agnhost images preloaded to the containerd in tests. Version fix: use agnhost images preloaded to the containerd in tests of test/e2e/customconfigs/waitforpodsready_test.go left in tree.
CONFLICT (modify/delete): test/e2e/customconfigs/objectretentionpolicies_test.go deleted in HEAD and modified in fix: use agnhost images preloaded to the containerd in tests. Version fix: use agnhost images preloaded to the containerd in tests of test/e2e/customconfigs/objectretentionpolicies_test.go left in tree.
Auto-merging test/e2e/customconfigs/managejobswithoutqueuename_test.go
CONFLICT (content): Merge conflict in test/e2e/customconfigs/managejobswithoutqueuename_test.go
CONFLICT (modify/delete): test/e2e/certmanager/metrics_test.go deleted in HEAD and modified in fix: use agnhost images preloaded to the containerd in tests. Version fix: use agnhost images preloaded to the containerd in tests of test/e2e/certmanager/metrics_test.go left in tree.
Auto-merging hack/e2e-common.sh
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 fix: use agnhost images preloaded to the containerd in tests

In response to this:

/approve /cherrypick release-0.12 /cherrypick release-0.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mykysha please prepare cherrypicks to deflake the release branches

mimowo avatar Jul 03 '25 10:07 mimowo

Replaced the agnhost image usage with reading from the env variable and defaulting to the one that was there before. This way, configurations that are available to pull an image to the cluster will use that image, minimizing the flaky behavior, and other configurations would work as they have been before.

In addition, I'll add another PR that would update the agnhost image version automatically using the dependabot

Awesome, thank you!

tenzen-y avatar Jul 03 '25 12:07 tenzen-y