kueue
kueue copied to clipboard
fix: use agnhost images preloaded to the containerd in tests
What type of PR is this?
/kind failing-test /kind flake
What this PR does / why we need it:
Remove digest from the agnhost images in e2e tests so that they are not re-pulled.
Which issue(s) this PR fixes:
Fixes #5639 Fixes #5640
Special notes for your reviewer:
Agnhost image is already being preloaded in the hack/e2e-common.sh:
# agnhost image to use for testing.
export E2E_TEST_AGNHOST_IMAGE_OLD=registry.k8s.io/e2e-test-images/agnhost:2.52@sha256:b173c7d0ffe3d805d49f4dfe48375169b7b8d2e1feb81783efd61eb9d08042e6
E2E_TEST_AGNHOST_IMAGE_OLD_WITHOUT_SHA=${E2E_TEST_AGNHOST_IMAGE_OLD%%@*}
export E2E_TEST_AGNHOST_IMAGE=registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85
E2E_TEST_AGNHOST_IMAGE_WITHOUT_SHA=${E2E_TEST_AGNHOST_IMAGE%%@*}
However, due to the preloaded image being loaded to the containerd without the sha256 digest, and the tests strictly requiring the use of an image with the digest, the images were being pulled again, resulting in some flaky behavior when any network errors occurred.
With the sha256 digests removed from the test images the tests can be ran without the internet connection after the initial setup, something that would not be possible before.
Does this PR introduce a user-facing change?
NONE
Deploy Preview for kubernetes-sigs-kueue ready!
| Name | Link |
|---|---|
| Latest commit | 99374edaf5891e47835aa82705416e6c5664b02c |
| Latest deploy log | https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/68650e1114c7ab0008f4a4ed |
| Deploy Preview | https://deploy-preview-5780--kubernetes-sigs-kueue.netlify.app |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify project configuration.
/cc @mbobrovskyi
Awesome! Thank you so much!
/lgtm
LGTM label has been added.
/cc @mimowo
Could you provide some explanation why this would actually fix the issues mentioned?
In particular, why the tests would fail due to re-pulling of the image.
Looking at the failed asserts it does not seem related directly to the image management. For example the tests show that 67/69 tests passed, so it is not clear to me how the other tests passed and this failed.
/hold I want to first understand what is the relationship between re-pulling the image and failures in the specific issues.
In e2e-common.sh, we define agnhost variables using both the tag and the digest:
https://github.com/kubernetes-sigs/kueue/blob/a9bde54b2fbffb5ba12a1d4cdd89fbb9b33d4a73/hack/e2e-common.sh#L65-L69
Later, we create a tag and pull the image using only the tag (without the digest):
https://github.com/kubernetes-sigs/kueue/blob/a9bde54b2fbffb5ba12a1d4cdd89fbb9b33d4a73/hack/e2e-common.sh#L105
https://github.com/kubernetes-sigs/kueue/blob/a9bde54b2fbffb5ba12a1d4cdd89fbb9b33d4a73/hack/e2e-common.sh#L133-L134
However, in tests we refer to the image using both tag and digest: https://github.com/kubernetes-sigs/kueue/blob/c08e7b1285e824391175868da59b38177c1abeda/test/util/e2e.go#L57-L60
As a result, the image is pulled again during tests using the digest, even though we already pulled it earlier using just the tag. To avoid this redundant pull, we should consistently use the tag without the digest in tests.
Could you provide some explanation why this would actually fix the issues mentioned?
In the kubelet logs we found an error:
Jun 12 20:35:17 kind-worker2 kubelet[231]: I0612 20:35:17.902789 231 status_manager.go:911] "Patch status for pod" pod="e2e-grrwf/job-set-replicated-job-1-0-1-clwkq" podUID="26a2c16a-e30c-4b10-bf2b-8d76e937fc4e" patch="{\"metadata\":{\"uid\":\"26a2c16a-e30c-4b10-bf2b-8d76e937fc4e\"},\"status\":{\"containerStatuses\":[{\"image\":\"registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\",\"imageID\":\"\",\"lastState\":{},\"name\":\"c\",\"ready\":false,\"restartCount\":0,\"started\":false,\"state\":{\"waiting\":{\"message\":\"Back-off pulling image \\\"registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\\\": ErrImagePull: failed to pull and unpack image \\\"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\\\": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry.k8s.io/v2/e2e-test-images/agnhost/manifests/sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85: 500 Internal Server Error - Server message: unknown: Internal error encountered.\",\"reason\":\"ImagePullBackOff\"}},\"volumeMounts\":[{\"mountPath\":\"/var/run/secrets/kubernetes.io/serviceaccount\",\"name\":\"kube-api-access-jl6wk\",\"readOnly\":true,\"recursiveReadOnly\":\"Disabled\"}]}]}}"
Jun 12 20:35:17 kind-worker2 kubelet[231]: I0612 20:35:17.902911 231 status_manager.go:920] "Status for pod updated successfully" pod="e2e-grrwf/job-set-replicated-job-1-0-1-clwkq" statusVersion=5 status={"phase":"Pending","conditions":[{"type":"PodReadyToStartContainers","status":"True","lastProbeTime":null,"lastTransitionTime":"2025-06-12T20:34:36Z"},{"type":"Initialized","status":"True","lastProbeTime":null,"lastTransitionTime":"2025-06-12T20:34:34Z"},{"type":"Ready","status":"False","lastProbeTime":null,"lastTransitionTime":"2025-06-12T20:34:34Z","reason":"ContainersNotReady","message":"containers with unready status: [c]"},{"type":"ContainersReady","status":"False","lastProbeTime":null,"lastTransitionTime":"2025-06-12T20:34:34Z","reason":"ContainersNotReady","message":"containers with unready status: [c]"},{"type":"PodScheduled","status":"True","lastProbeTime":null,"lastTransitionTime":"2025-06-12T20:34:34Z"}],"hostIP":"172.18.0.2","hostIPs":[{"ip":"172.18.0.2"}],"podIP":"10.244.2.6","podIPs":[{"ip":"10.244.2.6"}],"startTime":"2025-06-12T20:34:34Z","containerStatuses":[{"name":"c","state":{"waiting":{"reason":"ImagePullBackOff","message":"Back-off pulling image \"registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": ErrImagePull: failed to pull and unpack image \"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry.k8s.io/v2/e2e-test-images/agnhost/manifests/sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85: 500 Internal Server Error - Server message: unknown: Internal error encountered."}},"lastState":{},"ready":false,"restartCount":0,"image":"registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85","imageID":"","started":false,"volumeMounts":[{"name":"kube-api-access-jl6wk","mountPath":"/var/run/secrets/kubernetes.io/serviceaccount","readOnly":true,"recursiveReadOnly":"Disabled"}]}],"qosClass":"Guaranteed"}
Jun 12 20:35:18 kind-worker2 kubelet[231]: I0612 20:35:18.053666 231 projected.go:185] Setting up volume kube-api-access-jl6wk for pod 26a2c16a-e30c-4b10-bf2b-8d76e937fc4e at /var/lib/kubelet/pods/26a2c16a-e30c-4b10-bf2b-8d76e937fc4e/volumes/kubernetes.io~projected/kube-api-access-jl6wk
Jun 12 20:35:18 kind-worker2 kubelet[231]: E0612 20:35:18.515321 231 log.go:32] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry.k8s.io/v2/e2e-test-images/agnhost/manifests/sha256:1c5d47ecd9c4fca235ec0eeb9af0c39d8dd981ae703805a1f23676a9bf47c3bb: 500 Internal Server Error - Server message: unknown: Internal error encountered." image="registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85"
Jun 12 20:35:18 kind-worker2 kubelet[231]: E0612 20:35:18.515389 231 kuberuntime_image.go:55] "Failed to pull image" err="failed to pull and unpack image \"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry.k8s.io/v2/e2e-test-images/agnhost/manifests/sha256:1c5d47ecd9c4fca235ec0eeb9af0c39d8dd981ae703805a1f23676a9bf47c3bb: 500 Internal Server Error - Server message: unknown: Internal error encountered." image="registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85"
Jun 12 20:35:18 kind-worker2 kubelet[231]: I0612 20:35:18.515486 231 kuberuntime_image.go:51] "Pulling image without credentials" image="registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85"
Jun 12 20:35:18 kind-worker2 kubelet[231]: E0612 20:35:18.515644 231 kuberuntime_manager.go:1341] "Unhandled Error" err="container &Container{Name:c,Image:registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85,Command:[],Args:[entrypoint-tester],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:JOB_COMPLETION_INDEX,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.labels['batch.kubernetes.io/job-completion-index'],},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},},Resources:ResourceRequirements{Limits:ResourceList{cpu: {{500 -3} {<nil>} 500m DecimalSI},memory: {{200 6} {<nil>} 200M DecimalSI},},Requests:ResourceList{cpu: {{500 -3} {<nil>} 500m DecimalSI},memory: {{200 6} {<nil>} 200M DecimalSI},},Claims:[]ResourceClaim{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:kube-api-access-dz62k,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod job-set-replicated-job-1-0-0-clmt9_e2e-grrwf(a57da650-219f-454c-8dc7-e64aae90ad2e): ErrImagePull: failed to pull and unpack image \"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry.k8s.io/v2/e2e-test-images/agnhost/manifests/sha256:1c5d47ecd9c4fca235ec0eeb9af0c39d8dd981ae703805a1f23676a9bf47c3bb: 500 Internal Server Error - Server message: unknown: Internal error encountered." logger="UnhandledError"
Jun 12 20:35:18 kind-worker2 kubelet[231]: I0612 20:35:18.515716 231 event.go:389] "Event occurred" object="e2e-grrwf/job-set-replicated-job-1-0-0-clmt9" fieldPath="spec.containers{c}" kind="Pod" apiVersion="v1" type="Warning" reason="Failed" message="Failed to pull image \"registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": failed to pull and unpack image \"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry.k8s.io/v2/e2e-test-images/agnhost/manifests/sha256:1c5d47ecd9c4fca235ec0eeb9af0c39d8dd981ae703805a1f23676a9bf47c3bb: 500 Internal Server Error - Server message: unknown: Internal error encountered."
Jun 12 20:35:18 kind-worker2 kubelet[231]: I0612 20:35:18.515743 231 event.go:389] "Event occurred" object="e2e-grrwf/job-set-replicated-job-1-0-0-clmt9" fieldPath="spec.containers{c}" kind="Pod" apiVersion="v1" type="Warning" reason="Failed" message="Error: ErrImagePull"
Jun 12 20:35:18 kind-worker2 kubelet[231]: E0612 20:35:18.517118 231 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"c\" with ErrImagePull: \"failed to pull and unpack image \\\"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\\\": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry.k8s.io/v2/e2e-test-images/agnhost/manifests/sha256:1c5d47ecd9c4fca235ec0eeb9af0c39d8dd981ae703805a1f23676a9bf47c3bb: 500 Internal Server Error - Server message: unknown: Internal error encountered.\"" pod="e2e-grrwf/job-set-replicated-job-1-0-0-clmt9" podUID="a57da650-219f-454c-8dc7-e64aae90ad2e"
Jun 12 20:35:19 kind-worker2 kubelet[231]: E0612 20:35:19.001971 231 log.go:32] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": failed to resolve reference \"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": unexpected status from HEAD request to https://us-central1-docker.pkg.dev/v2/k8s-artifacts-prod/images/e2e-test-images/agnhost/manifests/sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85: 500 Internal Server Error" image="registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85"
That means we're trying to pull the image during the test because it's not available locally with both the tag and digest.
Looking at the failed asserts it does not seem related directly to the image management. For example the tests show that 67/69 tests passed, so it is not clear to me how the other tests passed and this failed.
I think this can be possible to fail due to a network issue, but it shouldn't happen in tests since we've already pulled the image in e2e-common.sh.
We’re seeing the same issue in another test case (related to LeaderWorkerSet). In the kubelet logs, the same error appears:
Jun 12 20:26:32 kind-worker kubelet[231]: I0612 20:26:32.394295 231 kuberuntime_image.go:51] "Pulling image without credentials" image="registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85"
Jun 12 20:26:32 kind-worker kubelet[231]: I0612 20:26:32.394387 231 event.go:389] "Event occurred" object="lws-e2e-9rbc8/lws-0" fieldPath="spec.containers{c}" kind="Pod" apiVersion="v1" type="Normal" reason="Pulling" message="Pulling image \"registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\""
Jun 12 20:26:32 kind-worker kubelet[231]: I0612 20:26:32.404255 231 status_manager.go:911] "Patch status for pod" pod="lws-e2e-9rbc8/lws-0" podUID="687af386-2fd4-477a-b817-f944efe5019b" patch="{\"metadata\":{\"uid\":\"687af386-2fd4-477a-b817-f944efe5019b\"},\"status\":{\"containerStatuses\":[{\"image\":\"registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\",\"imageID\":\"\",\"lastState\":{},\"name\":\"c\",\"ready\":false,\"restartCount\":0,\"started\":false,\"state\":{\"waiting\":{\"message\":\"Back-off pulling image \\\"registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\\\": ErrImagePull: failed to pull and unpack image \\\"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\\\": failed to resolve reference \\\"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\\\": unexpected status from HEAD request to https://us-central1-docker.pkg.dev/v2/k8s-artifacts-prod/images/e2e-test-images/agnhost/manifests/sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85: 500 Internal Server Error\",\"reason\":\"ImagePullBackOff\"}},\"volumeMounts\":[{\"mountPath\":\"/var/run/secrets/kubernetes.io/serviceaccount\",\"name\":\"kube-api-access-5sxhp\",\"readOnly\":true,\"recursiveReadOnly\":\"Disabled\"}]}]}}"
Jun 12 20:26:32 kind-worker kubelet[231]: I0612 20:26:32.404351 231 status_manager.go:920] "Status for pod updated successfully" pod="lws-e2e-9rbc8/lws-0" statusVersion=5 status={"phase":"Pending","conditions":[{"type":"PodReadyToStartContainers","status":"True","lastProbeTime":null,"lastTransitionTime":"2025-06-12T20:25:55Z"},{"type":"Initialized","status":"True","lastProbeTime":null,"lastTransitionTime":"2025-06-12T20:25:49Z"},{"type":"Ready","status":"False","lastProbeTime":null,"lastTransitionTime":"2025-06-12T20:25:49Z","reason":"ContainersNotReady","message":"containers with unready status: [c]"},{"type":"ContainersReady","status":"False","lastProbeTime":null,"lastTransitionTime":"2025-06-12T20:25:49Z","reason":"ContainersNotReady","message":"containers with unready status: [c]"},{"type":"PodScheduled","status":"True","lastProbeTime":null,"lastTransitionTime":"2025-06-12T20:25:49Z"}],"hostIP":"172.18.0.3","hostIPs":[{"ip":"172.18.0.3"}],"podIP":"10.244.1.7","podIPs":[{"ip":"10.244.1.7"}],"startTime":"2025-06-12T20:25:49Z","containerStatuses":[{"name":"c","state":{"waiting":{"reason":"ImagePullBackOff","message":"Back-off pulling image \"registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": ErrImagePull: failed to pull and unpack image \"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": failed to resolve reference \"registry.k8s.io/e2e-test-images/agnhost@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85\": unexpected status from HEAD request to https://us-central1-docker.pkg.dev/v2/k8s-artifacts-prod/images/e2e-test-images/agnhost/manifests/sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85: 500 Internal Server Error"}},"lastState":{},"ready":false,"restartCount":0,"image":"registry.k8s.io/e2e-test-images/agnhost:2.53@sha256:99c6b4bb4a1e1df3f0b3752168c89358794d02258ebebc26bf21c29399011a85","imageID":"","started":false,"volumeMounts":[{"name":"kube-api-access-5sxhp","mountPath":"/var/run/secrets/kubernetes.io/serviceaccount","readOnly":true,"recursiveReadOnly":"Disabled"}]}],"qosClass":"Burstable"}
Jun 12 20:26:32 kind-worker kubelet[231]: I0612 20:26:32.470523 231 projected.go:185] Setting up volume kube-api-access-f2bxl for pod 062c5331-7da8-4cd8-82d7-805a87ab2f41 at /var/lib/kubelet/pods/062c5331-7da8-4cd8-82d7-805a87ab2f41/volumes/kubernetes.io~projected/kube-api-access-f2bxl
Jun 12 20:26:32 kind-worker kubelet[231]: I0612 20:26:32.470620 231 projected.go:185] Setting up volume kube-api-access-5sxhp for pod 687af386-2fd4-477a-b817-f944efe5019b at /var/lib/kubelet/pods/687af386-2fd4-477a-b817-f944efe5019b/volumes/kubernetes.io~projected/kube-api-access-5sxhp
Jun 12 20:26:32 kind-worker kubelet[231]: I0612 20:26:32.470622 231 configmap.go:181] Setting up volume kube-proxy for pod 062c5331-7da8-4cd8-82d7-805a87ab2f41 at /var/lib/kubelet/pods/062c5331-7da8-4cd8-82d7-805a87ab2f41/volumes/kubernetes.io~configmap/kube-proxy
Jun 12 20:26:32 kind-worker kubelet[231]: I0612 20:26:32.470683 231 configmap.go:205] Received configMap kube-system/kube-proxy containing (2) pieces of data, 1755 total bytes
Jun 12 20:26:32 kind-worker kubelet[231]: I0612 20:26:32.470746 231 quota_linux.go:276] SupportsQuotas called, but quotas disabled
Jun 12 20:26:32 kind-worker kubelet[231]: I0612 20:26:32.470758 231 empty_dir.go:305] assignQuota called, hasQuotas = false userNamespacesEnabled = false
Jun 12 20:26:33 kind-worker kubelet[231]: I0612 20:26:33.001374 231 prober.go:116] "Probe succeeded" probeType="Readiness" pod="default/kuberay-operator-58c5788645-72j6m" podUID="14d17dd4-f0c7-498d-8834-7fa917d02ef0" containerName="kuberay-operator"
But this occurs on another worker node that also doesn’t have the agnhost image. The only issue we can identify is that Kubernetes attempts to pull the image, and something goes wrong on the registry side during that process, resulting in an error. This doesn’t seem to be related to Kueue.
Cool so I see the image is pulled during the test run, but why does it cause the test failure? is it because kueue is restarted just after the new image is pulled?
Cool so I see the image is pulled during the test run, but why does it cause the test failure? is it because kueue is restarted just after the new image is pulled?
No, this is singlecluster and we are not restarting Kueue here. I think it's just network issue.
Do you know why the network issue manifests with the assert failures like these?
In any case I'm ok to merge it , but I'm still why the network issue downloading the image results with the assert failures for kueue
Doesn't this occur the previous problem, again?
Previously, we tried using just the image digest. However, we ran into an issue when loading the image by digest into Kind using kind load docker-image.
https://github.com/kubernetes-sigs/kueue/blob/a9bde54b2fbffb5ba12a1d4cdd89fbb9b33d4a73/hack/e2e-common.sh#L101-L103
That’s why we decided to pull the Docker image by its digest
https://github.com/kubernetes-sigs/kueue/blob/a9bde54b2fbffb5ba12a1d4cdd89fbb9b33d4a73/hack/e2e-common.sh#L98-L99
manually create a tag for it
https://github.com/kubernetes-sigs/kueue/blob/a9bde54b2fbffb5ba12a1d4cdd89fbb9b33d4a73/hack/e2e-common.sh#L104-L105
and then load it into Kind without digest.
https://github.com/kubernetes-sigs/kueue/blob/a9bde54b2fbffb5ba12a1d4cdd89fbb9b33d4a73/hack/e2e-common.sh#L133-L134
After loading, we can use the tag without any problems.
Do you know why the network issue manifests with the assert failures like these?
To be honest, I'm not sure. But we do occasionally have network issues in Prow, so it might be related to that.
Do you know why the network issue manifests with the assert failures like these?
@mbobrovskyi do you know an answer to that? Again, not a blocker but I would like to know if I'm missing something obvious or we are just guessing based on correlations
Doesn't this occur the previous problem, again?
Previously, we tried using just the image digest. However, we ran into an issue when loading the image by digest into Kind using
kind load docker-image.https://github.com/kubernetes-sigs/kueue/blob/a9bde54b2fbffb5ba12a1d4cdd89fbb9b33d4a73/hack/e2e-common.sh#L101-L103
That’s why we decided to pull the Docker image by its digest
https://github.com/kubernetes-sigs/kueue/blob/a9bde54b2fbffb5ba12a1d4cdd89fbb9b33d4a73/hack/e2e-common.sh#L98-L99
manually create a tag for it
https://github.com/kubernetes-sigs/kueue/blob/a9bde54b2fbffb5ba12a1d4cdd89fbb9b33d4a73/hack/e2e-common.sh#L104-L105
and then load it into Kind without digest.
https://github.com/kubernetes-sigs/kueue/blob/a9bde54b2fbffb5ba12a1d4cdd89fbb9b33d4a73/hack/e2e-common.sh#L133-L134
After loading, we can use the tag without any problems.
Based on the previous one, they wanted to perform in another cluster instead of Kind cluster. And the cluster requires the digest. So, once this is merged, the problem will occur again.
the ImageContentSourcePolicy(ICSP) and ClusterImageSet(CIS) are adjusted such as the images required must be referred to digest rather than tag.
Can we keep the sha but only strip it when using in kind?
Can we keep the sha but only strip it when using in kind?
I think we can use env variables for it and replace if we are using Kind.
Replaced the agnhost image usage with reading from the env variable and defaulting to the one that was there before. This way, configurations that are available to pull an image to the cluster will use that image, minimizing the flaky behavior, and other configurations would work as they have been before.
In addition, I'll add another PR that would update the agnhost image version automatically using the dependabot
/hold I want to first understand what is the relationship between re-pulling the image and failures in the specific issues.
/unhold I synced on that with @mbobrovskyi and @mykysha and IIUC the reason for the test failures is that the failing image pull (due to a transitive network issue) results in the Pod not being able to start, and thus be running or finished, as expected by the tests.
LGTM label has been added.
@mimowo @tenzen-y PTAL
/approve /cherrypick release-0.12 /cherrypick release-0.11
@mimowo: once the present PR merges, I will cherry-pick it on top of release-0.11, release-0.12 in new PRs and assign them to you.
In response to this:
/approve /cherrypick release-0.12 /cherrypick release-0.11
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: mbobrovskyi, mimowo, mykysha
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [mimowo]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
@mimowo: #5780 failed to apply on top of branch "release-0.12":
Applying: fix: use agnhost images preloaded to the containerd in tests
Using index info to reconstruct a base tree...
M hack/e2e-common.sh
M test/e2e/certmanager/metrics_test.go
M test/e2e/customconfigs/waitforpodsready_test.go
M test/e2e/multikueue/e2e_test.go
M test/e2e/singlecluster/fair_sharing_test.go
M test/e2e/singlecluster/jaxjob_test.go
M test/e2e/singlecluster/metrics_test.go
A test/e2e/singlecluster/pytorchjob_test.go
M test/e2e/singlecluster/visibility_test.go
M test/e2e/tas/jobset_test.go
M test/util/e2e.go
Falling back to patching base and 3-way merge...
Auto-merging test/util/e2e.go
CONFLICT (content): Merge conflict in test/util/e2e.go
Auto-merging test/e2e/tas/jobset_test.go
CONFLICT (content): Merge conflict in test/e2e/tas/jobset_test.go
Auto-merging test/e2e/singlecluster/visibility_test.go
CONFLICT (modify/delete): test/e2e/singlecluster/pytorchjob_test.go deleted in HEAD and modified in fix: use agnhost images preloaded to the containerd in tests. Version fix: use agnhost images preloaded to the containerd in tests of test/e2e/singlecluster/pytorchjob_test.go left in tree.
Auto-merging test/e2e/singlecluster/metrics_test.go
Auto-merging test/e2e/singlecluster/jaxjob_test.go
Auto-merging test/e2e/singlecluster/fair_sharing_test.go
Auto-merging test/e2e/multikueue/e2e_test.go
Auto-merging test/e2e/customconfigs/waitforpodsready_test.go
Auto-merging test/e2e/certmanager/metrics_test.go
Auto-merging hack/e2e-common.sh
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 fix: use agnhost images preloaded to the containerd in tests
In response to this:
/approve /cherrypick release-0.12 /cherrypick release-0.11
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
@mimowo: #5780 failed to apply on top of branch "release-0.11":
Applying: fix: use agnhost images preloaded to the containerd in tests
Using index info to reconstruct a base tree...
M hack/e2e-common.sh
A test/e2e/certmanager/metrics_test.go
M test/e2e/customconfigs/managejobswithoutqueuename_test.go
A test/e2e/customconfigs/objectretentionpolicies_test.go
A test/e2e/customconfigs/waitforpodsready_test.go
M test/e2e/multikueue/e2e_test.go
M test/e2e/singlecluster/appwrapper_test.go
M test/e2e/singlecluster/deployment_test.go
M test/e2e/singlecluster/e2e_test.go
M test/e2e/singlecluster/fair_sharing_test.go
A test/e2e/singlecluster/jaxjob_test.go
M test/e2e/singlecluster/jobset_test.go
M test/e2e/singlecluster/leaderworkerset_test.go
M test/e2e/singlecluster/metrics_test.go
M test/e2e/singlecluster/pod_test.go
A test/e2e/singlecluster/pytorchjob_test.go
M test/e2e/singlecluster/statefulset_test.go
M test/e2e/singlecluster/tas_test.go
M test/e2e/singlecluster/visibility_test.go
M test/e2e/tas/appwrapper_test.go
M test/e2e/tas/job_test.go
M test/e2e/tas/jobset_test.go
M test/e2e/tas/leaderworkerset_test.go
M test/e2e/tas/mpijob_test.go
M test/e2e/tas/pod_group_test.go
M test/e2e/tas/pytorch_test.go
M test/e2e/tas/statefulset_test.go
M test/integration/singlecluster/controller/jobs/jobset/jobset_controller_test.go
M test/util/e2e.go
Falling back to patching base and 3-way merge...
Auto-merging test/util/e2e.go
CONFLICT (content): Merge conflict in test/util/e2e.go
Auto-merging test/integration/singlecluster/controller/jobs/jobset/jobset_controller_test.go
Auto-merging test/e2e/tas/statefulset_test.go
Auto-merging test/e2e/tas/pytorch_test.go
Auto-merging test/e2e/tas/pod_group_test.go
Auto-merging test/e2e/tas/mpijob_test.go
Auto-merging test/e2e/tas/leaderworkerset_test.go
Auto-merging test/e2e/tas/jobset_test.go
CONFLICT (content): Merge conflict in test/e2e/tas/jobset_test.go
Auto-merging test/e2e/tas/job_test.go
Auto-merging test/e2e/tas/appwrapper_test.go
Auto-merging test/e2e/singlecluster/visibility_test.go
CONFLICT (content): Merge conflict in test/e2e/singlecluster/visibility_test.go
Auto-merging test/e2e/singlecluster/tas_test.go
Auto-merging test/e2e/singlecluster/statefulset_test.go
CONFLICT (content): Merge conflict in test/e2e/singlecluster/statefulset_test.go
CONFLICT (modify/delete): test/e2e/singlecluster/pytorchjob_test.go deleted in HEAD and modified in fix: use agnhost images preloaded to the containerd in tests. Version fix: use agnhost images preloaded to the containerd in tests of test/e2e/singlecluster/pytorchjob_test.go left in tree.
Auto-merging test/e2e/singlecluster/pod_test.go
Auto-merging test/e2e/singlecluster/metrics_test.go
Auto-merging test/e2e/singlecluster/leaderworkerset_test.go
CONFLICT (content): Merge conflict in test/e2e/singlecluster/leaderworkerset_test.go
Auto-merging test/e2e/singlecluster/jobset_test.go
CONFLICT (modify/delete): test/e2e/singlecluster/jaxjob_test.go deleted in HEAD and modified in fix: use agnhost images preloaded to the containerd in tests. Version fix: use agnhost images preloaded to the containerd in tests of test/e2e/singlecluster/jaxjob_test.go left in tree.
Auto-merging test/e2e/singlecluster/fair_sharing_test.go
CONFLICT (content): Merge conflict in test/e2e/singlecluster/fair_sharing_test.go
Auto-merging test/e2e/singlecluster/e2e_test.go
CONFLICT (content): Merge conflict in test/e2e/singlecluster/e2e_test.go
Auto-merging test/e2e/singlecluster/deployment_test.go
Auto-merging test/e2e/singlecluster/appwrapper_test.go
Auto-merging test/e2e/multikueue/e2e_test.go
CONFLICT (modify/delete): test/e2e/customconfigs/waitforpodsready_test.go deleted in HEAD and modified in fix: use agnhost images preloaded to the containerd in tests. Version fix: use agnhost images preloaded to the containerd in tests of test/e2e/customconfigs/waitforpodsready_test.go left in tree.
CONFLICT (modify/delete): test/e2e/customconfigs/objectretentionpolicies_test.go deleted in HEAD and modified in fix: use agnhost images preloaded to the containerd in tests. Version fix: use agnhost images preloaded to the containerd in tests of test/e2e/customconfigs/objectretentionpolicies_test.go left in tree.
Auto-merging test/e2e/customconfigs/managejobswithoutqueuename_test.go
CONFLICT (content): Merge conflict in test/e2e/customconfigs/managejobswithoutqueuename_test.go
CONFLICT (modify/delete): test/e2e/certmanager/metrics_test.go deleted in HEAD and modified in fix: use agnhost images preloaded to the containerd in tests. Version fix: use agnhost images preloaded to the containerd in tests of test/e2e/certmanager/metrics_test.go left in tree.
Auto-merging hack/e2e-common.sh
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 fix: use agnhost images preloaded to the containerd in tests
In response to this:
/approve /cherrypick release-0.12 /cherrypick release-0.11
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
@mykysha please prepare cherrypicks to deflake the release branches
Replaced the agnhost image usage with reading from the env variable and defaulting to the one that was there before. This way, configurations that are available to pull an image to the cluster will use that image, minimizing the flaky behavior, and other configurations would work as they have been before.
In addition, I'll add another PR that would update the agnhost image version automatically using the dependabot
Awesome, thank you!