system-upgrade-controller
system-upgrade-controller copied to clipboard
Spawned jobs use broken (truncated) image names
Version system-upgrade-controller version v0.9.1 (79da9f0)
Platform/Architecture linux-amd64
Describe the bug The jobs spawned by SUC use an image name truncated at 64 characters. This leads to Kubernetes not being able to run the jobs since these images simply do not exist.
To Reproduce
We experienced the bug with the following Plan. Please note that I cannot post company internal image and plan names so I had to replace them. Either way, the images are not publicly available. Please also note that we have configured a registry mirror to avoid having to specify an explicit registry. The key point for reproduction is that the image name must be longer than 64 characters. In our case, the image names are that long because we are using a deeply nested structure in GitLab to organize our code base.
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: xxxxxx
spec:
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/master
operator: In
values:
- "true"
upgrade:
image: aa/bbbb/ccccccccccccccccc/dddddddddddddddddd/eeeeeeeeee/xxxxxx/master
command: ["/usr/local/bin/bash"]
args: ["/root/install.sh"]
concurrency: 1
version: "981a1f90cfcad63ea864957f996cde9b2e99e587"
SUC will pick up this plan and spawn a job to execute it. Kubernetes will then spawn pods accordingly which will fail due to the "truncated" image not existing. I was not able to describe the job or its pods because SUC seems to constantly replace the job. Using the configmap, I was able to enable debug mode though (SYSTEM_UPGRADE_CONTROLLER_DEBUG: true) and it seems to then log the job spec as JSON. This reveals the truncated image name in spec.template.spec.containers[0].image: aa/bbbb/ccccccccccccccccc/dddddddddddddddddd/eeeeeeeeee/xxxxxx/m
Expected behavior
I expect the spawned jobs to use the correct image name. In this example, this would be aa/bbbb/ccccccccccccccccc/dddddddddddddddddd/eeeeeeeeee/xxxxxx/master:981a1f90cfcad63ea864957f996cde9b2e99e587.
Actual behavior
The jobs use an image named truncated at 64 characters: aa/bbbb/ccccccccccccccccc/dddddddddddddddddd/eeeeeeeeee/xxxxxx/m.
Additional context None - but I'm happy to provide more information if required.
Hm, after some deeper research the name does not seem to stay truncated. At some point, I now managed to get one job as yaml via kubectl and this yaml seems to have the correct image name
Regardless of this, SUC keeps deleting and creating the same job over and over without giving it time to run. The same job name has a different .metadata.uid every few seconds.
I cannot find an obvious reason why it's doing this. In which cases would SUC do this?
This looks like this issue which I reported a while back:
#164
Yeah, I also saw some similarities. I will try a differnt image next week with only some "hello world" content to make sure it's really not my image's fault. I still find the behavior rather weird though, even if it was my image failing.
It's not my image's fault, I have the same problem even if it's some bash:latest image which only prints "hello world"
Shoot, this smells like a wrangler bug. Or maybe docker/distribution, oh no.
I am unable to replicate this with rancher/system-upgrade-controller:v0.10.0-dev.2
I mirrored the library/busybox image to a local registry with a very long (97 character) name, and the resulting Job spec is correct:
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: busybox-test
namespace: system-upgrade
spec:
nodeSelector:
matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
upgrade:
image: local.registry.k3s.khaus/e6b2900c03ccaf3a32b6020cadfdfc46eac4889ddf603ea39d66082504a5e7c3/busybox
command: ["/bin/sh"]
args: ["-c", "sleep 60"]
concurrency: 1
version: "1.35.0-uclibc"
brandond@dev01:~/suc-test$ kubectl describe job -n system-upgrade -l upgrade.cattle.io/plan=busybox-test
Name: apply-busybox-test-on-k3s-server-1-with-7dfd8044aa7c45620-3655e
Namespace: system-upgrade
Selector: controller-uid=d76e37f9-bc02-4270-9a27-c669c80d0cbb
Labels: objectset.rio.cattle.io/hash=6fb02a1a07227c1f395e337028a47ca6942caaa2
plan.upgrade.cattle.io/busybox-test=7dfd8044aa7c456209bd9a936377d75caf7b42dde345cfd6535981d1
upgrade.cattle.io/controller=system-upgrade-controller
upgrade.cattle.io/node=k3s-server-1
upgrade.cattle.io/plan=busybox-test
upgrade.cattle.io/version=1.35.0-uclibc
Annotations: batch.kubernetes.io/job-tracking:
objectset.rio.cattle.io/applied:
H4sIAAAAAAAA/+xWW2/bOBP9KwWfJUe2fIkNfA/+YndrtHGMOu2iKIJgRI5srilSS46cGIb/+4KS4sjNpWn3dWHAEC8zPDxzZoZ7liGBAAI22jPQ2hCQNNr5oUn+Qk4OqWWlaXEgUt...
objectset.rio.cattle.io/id: system-upgrade-controller
objectset.rio.cattle.io/owner-gvk: upgrade.cattle.io/v1, Kind=Plan
objectset.rio.cattle.io/owner-name: busybox-test
objectset.rio.cattle.io/owner-namespace: system-upgrade
upgrade.cattle.io/ttl-seconds-after-finished: 900
Parallelism: 1
Completions: 1
Completion Mode: NonIndexed
Start Time: Wed, 13 Jul 2022 13:17:25 -0700
Completed At: Wed, 13 Jul 2022 13:18:29 -0700
Duration: 64s
Active Deadline Seconds: 900s
Pods Statuses: 0 Running / 1 Succeeded / 0 Failed
Pod Template:
Labels: controller-uid=d76e37f9-bc02-4270-9a27-c669c80d0cbb
job-name=apply-busybox-test-on-k3s-server-1-with-7dfd8044aa7c45620-3655e
plan.upgrade.cattle.io/busybox-test=7dfd8044aa7c456209bd9a936377d75caf7b42dde345cfd6535981d1
upgrade.cattle.io/controller=system-upgrade-controller
upgrade.cattle.io/node=k3s-server-1
upgrade.cattle.io/plan=busybox-test
upgrade.cattle.io/version=1.35.0-uclibc
Containers:
upgrade:
Image: local.registry.k3s.khaus/e6b2900c03ccaf3a32b6020cadfdfc46eac4889ddf603ea39d66082504a5e7c3/busybox:1.35.0-uclibc
Port: <none>
Host Port: <none>
Command:
/bin/sh
Args:
-c
sleep 60
Environment:
SYSTEM_UPGRADE_NODE_NAME: (v1:spec.nodeName)
SYSTEM_UPGRADE_POD_NAME: (v1:metadata.name)
SYSTEM_UPGRADE_POD_UID: (v1:metadata.uid)
SYSTEM_UPGRADE_PLAN_NAME: busybox-test
SYSTEM_UPGRADE_PLAN_LATEST_HASH: 7dfd8044aa7c456209bd9a936377d75caf7b42dde345cfd6535981d1
SYSTEM_UPGRADE_PLAN_LATEST_VERSION: 1.35.0-uclibc
Mounts:
/host from host-root (rw)
/run/system-upgrade/pod from pod-info (ro)
Volumes:
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType: Directory
pod-info:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.labels -> labels
metadata.annotations -> annotations
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 5m16s job-controller Created pod: apply-busybox-test-on-k3s-server-1-with-7dfd8044aa7c45620-fg55r
Normal Completed 4m12s job-controller Job completed
@brandond Can you try with an image with a path deeper than two? Should be supported by default with registry:2, e.g. <registry address>/this/is/one/long/but/legit/name:tag
@brandond Can you try with an image with a path deeper than two? Should be supported by default with
registry:2, e.g.<registry address>/this/is/one/long/but/legit/name:tag
I suspect a bug in either rancher/wrangler or distribution/distribution but it is possible that my initial use of the distribution functions to adjust the tag is enforcing the 2-segment paths assumption and mangling everything else.
brandond@dev01:~/suc-test$ skopeo sync --src docker --dest docker docker.io/library/busybox:1.35.0-uclibc local.registry.k3s.khaus/$(openssl rand -hex 16)/$(openssl rand -hex 16)/$(openssl rand -hex 16)
INFO[0000] Tag presence check imagename="docker.io/library/busybox:1.35.0-uclibc" tagged=true
INFO[0000] Copying image ref 1/1 from="docker://busybox:1.35.0-uclibc" to="docker://local.registry.k3s.khaus/7f19c02113f03500a5c8d72ce1a44965/d45508d8fdaa912cb352a97d14b7af1d/483e6eb76483325ea686ace1485b4b31/busybox:1.35.0-uclibc"
Seems to be stuck in a loop and the jobs never get created.
time="2022-07-13T22:00:33Z" level=debug msg="PLAN STATUS HANDLER: plan=system-upgrade/busybox-test@658, status={Conditions:[] LatestVersion: LatestHash: Applying:[]}" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:00:33Z" level=debug msg="PLAN GENERATING HANDLER: plan=system-upgrade/busybox-test@658, status={Conditions:[{Type:LatestResolved Status:True LastUpdateTime:2022-07-13T22:00:33Z LastTransitionTime: Reason:Version Message:}] LatestVersion:1.35.0-uclibc LatestHash:7dfd8044aa7c456209bd9a936377d75caf7b42dde345cfd6535981d1 Applying:[]}" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:00:33Z" level=debug msg="DesiredSet - Created batch/v1, Kind=Job system-upgrade/apply-busybox-test-on-k3s-server-1-with-7dfd8044aa7c45620-3655e for system-upgrade-controller system-upgrade/busybox-test" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:00:33Z" level=error msg="error syncing 'system-upgrade/busybox-test': handler system-upgrade-controller: plans.upgrade.cattle.io \"busybox-test\" not found, handler system-upgrade-controller: plans.upgrade.cattle.io \"busybox-test\" not found, requeuing" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:00:33Z" level=error msg="error syncing 'system-upgrade/apply-busybox-test-on-k3s-server-1-with-7dfd8044aa7c45620-3655e': handler system-upgrade-controller: jobs.batch \"apply-busybox-test-on-k3s-server-1-with-7dfd8044aa7c45620-3655e\" not found, requeuing" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:01:03Z" level=debug msg="PLAN STATUS HANDLER: plan=system-upgrade/busybox-test@658, status={Conditions:[] LatestVersion: LatestHash: Applying:[]}" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:01:03Z" level=debug msg="PLAN GENERATING HANDLER: plan=system-upgrade/busybox-test@658, status={Conditions:[{Type:LatestResolved Status:True LastUpdateTime:2022-07-13T22:01:03Z LastTransitionTime: Reason:Version Message:}] LatestVersion:1.35.0-uclibc LatestHash:7dfd8044aa7c456209bd9a936377d75caf7b42dde345cfd6535981d1 Applying:[]}" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:01:03Z" level=debug msg="DesiredSet - Created batch/v1, Kind=Job system-upgrade/apply-busybox-test-on-k3s-server-1-with-7dfd8044aa7c45620-3655e for system-upgrade-controller system-upgrade/busybox-test" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:01:03Z" level=error msg="error syncing 'system-upgrade/busybox-test': handler system-upgrade-controller: plans.upgrade.cattle.io \"busybox-test\" not found, handler system-upgrade-controller: plans.upgrade.cattle.io \"busybox-test\" not found, requeuing" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:01:33Z" level=debug msg="PLAN STATUS HANDLER: plan=system-upgrade/busybox-test@658, status={Conditions:[] LatestVersion: LatestHash: Applying:[]}" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:01:33Z" level=debug msg="PLAN GENERATING HANDLER: plan=system-upgrade/busybox-test@658, status={Conditions:[{Type:LatestResolved Status:True LastUpdateTime:2022-07-13T22:01:33Z LastTransitionTime: Reason:Version Message:}] LatestVersion:1.35.0-uclibc LatestHash:7dfd8044aa7c456209bd9a936377d75caf7b42dde345cfd6535981d1 Applying:[]}" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:01:34Z" level=debug msg="DesiredSet - Created batch/v1, Kind=Job system-upgrade/apply-busybox-test-on-k3s-server-1-with-7dfd8044aa7c45620-3655e for system-upgrade-controller system-upgrade/busybox-test" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:01:34Z" level=error msg="error syncing 'system-upgrade/busybox-test': handler system-upgrade-controller: plans.upgrade.cattle.io \"busybox-test\" not found, handler system-upgrade-controller: plans.upgrade.cattle.io \"busybox-test\" not found, requeuing" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
The not found errors for both the plan and job resources seems to indicate that some of the informer caches aren't getting started properly? If I delete the SUC pod, the next controller pod that comes up creates the job successfully:
brandond@dev01:~/suc-test$ kubectl describe job -n system-upgrade -l upgrade.cattle.io/plan=busybox-test
Name: apply-busybox-test-on-k3s-server-1-with-7dfd8044aa7c45620-3655e
Namespace: system-upgrade
Selector: controller-uid=267c240d-ccdb-4901-9c99-927fc4bec3cb
Labels: objectset.rio.cattle.io/hash=6fb02a1a07227c1f395e337028a47ca6942caaa2
plan.upgrade.cattle.io/busybox-test=7dfd8044aa7c456209bd9a936377d75caf7b42dde345cfd6535981d1
upgrade.cattle.io/controller=system-upgrade-controller
upgrade.cattle.io/node=k3s-server-1
upgrade.cattle.io/plan=busybox-test
upgrade.cattle.io/version=1.35.0-uclibc
Annotations: batch.kubernetes.io/job-tracking:
objectset.rio.cattle.io/applied:
H4sIAAAAAAAA/+xWW2/bOBP9KwWfJUe+yI4NfA/+YndrtHGMOu2iKIJgRI5srilSS46cGIb/+4KS4sjNpWn3dWHAEC8zPDxzZoZ7liGBAAI22jPQ2hCQNNr5oUn+Qk4OqWWlaXEgUt...
objectset.rio.cattle.io/id: system-upgrade-controller
objectset.rio.cattle.io/owner-gvk: upgrade.cattle.io/v1, Kind=Plan
objectset.rio.cattle.io/owner-name: busybox-test
objectset.rio.cattle.io/owner-namespace: system-upgrade
upgrade.cattle.io/ttl-seconds-after-finished: 900
Parallelism: 1
Completions: 1
Completion Mode: NonIndexed
Start Time: Wed, 13 Jul 2022 15:04:21 -0700
Completed At: Wed, 13 Jul 2022 15:05:25 -0700
Duration: 64s
Active Deadline Seconds: 900s
Pods Statuses: 0 Running / 1 Succeeded / 0 Failed
Pod Template:
Labels: controller-uid=267c240d-ccdb-4901-9c99-927fc4bec3cb
job-name=apply-busybox-test-on-k3s-server-1-with-7dfd8044aa7c45620-3655e
plan.upgrade.cattle.io/busybox-test=7dfd8044aa7c456209bd9a936377d75caf7b42dde345cfd6535981d1
upgrade.cattle.io/controller=system-upgrade-controller
upgrade.cattle.io/node=k3s-server-1
upgrade.cattle.io/plan=busybox-test
upgrade.cattle.io/version=1.35.0-uclibc
Containers:
upgrade:
Image: local.registry.k3s.khaus/7f19c02113f03500a5c8d72ce1a44965/d45508d8fdaa912cb352a97d14b7af1d/483e6eb76483325ea686ace1485b4b31/busybox:1.35.0-uclibc
Port: <none>
Host Port: <none>
Command:
/bin/sh
Args:
-c
sleep 60
Environment:
SYSTEM_UPGRADE_NODE_NAME: (v1:spec.nodeName)
SYSTEM_UPGRADE_POD_NAME: (v1:metadata.name)
SYSTEM_UPGRADE_POD_UID: (v1:metadata.uid)
SYSTEM_UPGRADE_PLAN_NAME: busybox-test
SYSTEM_UPGRADE_PLAN_LATEST_HASH: 7dfd8044aa7c456209bd9a936377d75caf7b42dde345cfd6535981d1
SYSTEM_UPGRADE_PLAN_LATEST_VERSION: 1.35.0-uclibc
Mounts:
/host from host-root (rw)
/run/system-upgrade/pod from pod-info (ro)
Volumes:
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType: Directory
pod-info:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.labels -> labels
metadata.annotations -> annotations
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 107s job-controller Created pod: apply-busybox-test-on-k3s-server-1-with-7dfd8044aa7c45620-829jw
Normal Completed 43s job-controller Job completed
@brandond Could you please clarify whether the issue was reproduced or not? It's not clear from the comment above. Thanks!
No, I was not able to reproduce any sort of image name truncation.
There does appear to be a different issue that causes the initial run of the controller to throw errors about not finding resources that do exist in the cluster, but that is not the issue that is being discussed here.
Okay, thanks for confirming. That matches my observations of not reproducing the issue with system-upgrade-controller v0.9.1:
When defining a plan with an image that has a path deeper than 2, then both the job and the pod spawned to execute the job are defined with the full image image name. It's not truncated.
Here is the image i used: quay.io/janek/eee1da0de3b/00eaad3b/00ea1d192/0e323de3b/busybox:latest
Edit: Also tried with the image from the OP with the same result of not being truncated.
quay.io/janek/bbbb/ccccccccccccccccc/dddddddddddddddddd/eeeeeeeeee/xxxxxx/master:981a1f90cfcad63ea864957f996cde9b2e99e587
@webD97 Could you provide the full content of the plan that causes this issue for you? Along with the version of Kubernetes you are using?
cannot reproduce