system-upgrade-controller Spawned jobs use broken (truncated) image names

trafficstars

Version system-upgrade-controller version v0.9.1 (79da9f0)

Platform/Architecture linux-amd64

Describe the bug The jobs spawned by SUC use an image name truncated at 64 characters. This leads to Kubernetes not being able to run the jobs since these images simply do not exist.

To Reproduce We experienced the bug with the following Plan. Please note that I cannot post company internal image and plan names so I had to replace them. Either way, the images are not publicly available. Please also note that we have configured a registry mirror to avoid having to specify an explicit registry. The key point for reproduction is that the image name must be longer than 64 characters. In our case, the image names are that long because we are using a deeply nested structure in GitLab to organize our code base.

apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: xxxxxx
spec:
  nodeSelector:
    matchExpressions:
      - key: node-role.kubernetes.io/master
        operator: In
        values:
          - "true"
  upgrade:
    image: aa/bbbb/ccccccccccccccccc/dddddddddddddddddd/eeeeeeeeee/xxxxxx/master
    command: ["/usr/local/bin/bash"]
    args: ["/root/install.sh"]
  concurrency: 1
  version: "981a1f90cfcad63ea864957f996cde9b2e99e587"

SUC will pick up this plan and spawn a job to execute it. Kubernetes will then spawn pods accordingly which will fail due to the "truncated" image not existing. I was not able to describe the job or its pods because SUC seems to constantly replace the job. Using the configmap, I was able to enable debug mode though (SYSTEM_UPGRADE_CONTROLLER_DEBUG: true) and it seems to then log the job spec as JSON. This reveals the truncated image name in spec.template.spec.containers[0].image: aa/bbbb/ccccccccccccccccc/dddddddddddddddddd/eeeeeeeeee/xxxxxx/m

Expected behavior I expect the spawned jobs to use the correct image name. In this example, this would be aa/bbbb/ccccccccccccccccc/dddddddddddddddddd/eeeeeeeeee/xxxxxx/master:981a1f90cfcad63ea864957f996cde9b2e99e587.

Actual behavior The jobs use an image named truncated at 64 characters: aa/bbbb/ccccccccccccccccc/dddddddddddddddddd/eeeeeeeeee/xxxxxx/m.

Additional context None - but I'm happy to provide more information if required.

May 17 '22 08:05 webD97

Hm, after some deeper research the name does not seem to stay truncated. At some point, I now managed to get one job as yaml via kubectl and this yaml seems to have the correct image name

May 17 '22 12:05 webD97

Regardless of this, SUC keeps deleting and creating the same job over and over without giving it time to run. The same job name has a different .metadata.uid every few seconds.

I cannot find an obvious reason why it's doing this. In which cases would SUC do this?

May 17 '22 12:05 webD97

This looks like this issue which I reported a while back:

#164

May 19 '22 14:05 ntx-ben

Yeah, I also saw some similarities. I will try a differnt image next week with only some "hello world" content to make sure it's really not my image's fault. I still find the behavior rather weird though, even if it was my image failing.

May 19 '22 14:05 webD97

It's not my image's fault, I have the same problem even if it's some bash:latest image which only prints "hello world"

May 28 '22 12:05 webD97

Shoot, this smells like a wrangler bug. Or maybe docker/distribution, oh no.

Jun 03 '22 05:06 dweomer

I am unable to replicate this with rancher/system-upgrade-controller:v0.10.0-dev.2

I mirrored the library/busybox image to a local registry with a very long (97 character) name, and the resulting Job spec is correct:

apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: busybox-test
  namespace: system-upgrade
spec:
  nodeSelector:
    matchExpressions:
    - key: kubernetes.io/os
      operator: In
      values:
      - linux
  upgrade:
    image: local.registry.k3s.khaus/e6b2900c03ccaf3a32b6020cadfdfc46eac4889ddf603ea39d66082504a5e7c3/busybox
    command: ["/bin/sh"]
    args: ["-c", "sleep 60"]
  concurrency: 1
  version: "1.35.0-uclibc"

brandond@dev01:~/suc-test$ kubectl describe job -n system-upgrade -l upgrade.cattle.io/plan=busybox-test
Name:                     apply-busybox-test-on-k3s-server-1-with-7dfd8044aa7c45620-3655e
Namespace:                system-upgrade
Selector:                 controller-uid=d76e37f9-bc02-4270-9a27-c669c80d0cbb
Labels:                   objectset.rio.cattle.io/hash=6fb02a1a07227c1f395e337028a47ca6942caaa2
                          plan.upgrade.cattle.io/busybox-test=7dfd8044aa7c456209bd9a936377d75caf7b42dde345cfd6535981d1
                          upgrade.cattle.io/controller=system-upgrade-controller
                          upgrade.cattle.io/node=k3s-server-1
                          upgrade.cattle.io/plan=busybox-test
                          upgrade.cattle.io/version=1.35.0-uclibc
Annotations:              batch.kubernetes.io/job-tracking:
                          objectset.rio.cattle.io/applied:
                            H4sIAAAAAAAA/+xWW2/bOBP9KwWfJUe2fIkNfA/+YndrtHGMOu2iKIJgRI5srilSS46cGIb/+4KS4sjNpWn3dWHAEC8zPDxzZoZ7liGBAAI22jPQ2hCQNNr5oUn+Qk4OqWWlaXEgUt...
                          objectset.rio.cattle.io/id: system-upgrade-controller
                          objectset.rio.cattle.io/owner-gvk: upgrade.cattle.io/v1, Kind=Plan
                          objectset.rio.cattle.io/owner-name: busybox-test
                          objectset.rio.cattle.io/owner-namespace: system-upgrade
                          upgrade.cattle.io/ttl-seconds-after-finished: 900
Parallelism:              1
Completions:              1
Completion Mode:          NonIndexed
Start Time:               Wed, 13 Jul 2022 13:17:25 -0700
Completed At:             Wed, 13 Jul 2022 13:18:29 -0700
Duration:                 64s
Active Deadline Seconds:  900s
Pods Statuses:            0 Running / 1 Succeeded / 0 Failed
Pod Template:
  Labels:  controller-uid=d76e37f9-bc02-4270-9a27-c669c80d0cbb
           job-name=apply-busybox-test-on-k3s-server-1-with-7dfd8044aa7c45620-3655e
           plan.upgrade.cattle.io/busybox-test=7dfd8044aa7c456209bd9a936377d75caf7b42dde345cfd6535981d1
           upgrade.cattle.io/controller=system-upgrade-controller
           upgrade.cattle.io/node=k3s-server-1
           upgrade.cattle.io/plan=busybox-test
           upgrade.cattle.io/version=1.35.0-uclibc
  Containers:
   upgrade:
    Image:      local.registry.k3s.khaus/e6b2900c03ccaf3a32b6020cadfdfc46eac4889ddf603ea39d66082504a5e7c3/busybox:1.35.0-uclibc
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/sh
    Args:
      -c
      sleep 60
    Environment:
      SYSTEM_UPGRADE_NODE_NAME:             (v1:spec.nodeName)
      SYSTEM_UPGRADE_POD_NAME:              (v1:metadata.name)
      SYSTEM_UPGRADE_POD_UID:               (v1:metadata.uid)
      SYSTEM_UPGRADE_PLAN_NAME:            busybox-test
      SYSTEM_UPGRADE_PLAN_LATEST_HASH:     7dfd8044aa7c456209bd9a936377d75caf7b42dde345cfd6535981d1
      SYSTEM_UPGRADE_PLAN_LATEST_VERSION:  1.35.0-uclibc
    Mounts:
      /host from host-root (rw)
      /run/system-upgrade/pod from pod-info (ro)
  Volumes:
   host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  Directory
   pod-info:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
      metadata.annotations -> annotations
Events:
  Type    Reason            Age    From            Message
  ----    ------            ----   ----            -------
  Normal  SuccessfulCreate  5m16s  job-controller  Created pod: apply-busybox-test-on-k3s-server-1-with-7dfd8044aa7c45620-fg55r
  Normal  Completed         4m12s  job-controller  Job completed

Jul 13 '22 20:07 brandond

@brandond Can you try with an image with a path deeper than two? Should be supported by default with registry:2, e.g. <registry address>/this/is/one/long/but/legit/name:tag

Jul 13 '22 21:07 dweomer

@brandond Can you try with an image with a path deeper than two? Should be supported by default with registry:2, e.g. <registry address>/this/is/one/long/but/legit/name:tag

I suspect a bug in either rancher/wrangler or distribution/distribution but it is possible that my initial use of the distribution functions to adjust the tag is enforcing the 2-segment paths assumption and mangling everything else.

Jul 13 '22 21:07 dweomer

brandond@dev01:~/suc-test$ skopeo sync --src docker --dest docker docker.io/library/busybox:1.35.0-uclibc local.registry.k3s.khaus/$(openssl rand -hex 16)/$(openssl rand -hex 16)/$(openssl rand -hex 16)
INFO[0000] Tag presence check                            imagename="docker.io/library/busybox:1.35.0-uclibc" tagged=true
INFO[0000] Copying image ref 1/1                         from="docker://busybox:1.35.0-uclibc" to="docker://local.registry.k3s.khaus/7f19c02113f03500a5c8d72ce1a44965/d45508d8fdaa912cb352a97d14b7af1d/483e6eb76483325ea686ace1485b4b31/busybox:1.35.0-uclibc"

Seems to be stuck in a loop and the jobs never get created.

time="2022-07-13T22:00:33Z" level=debug msg="PLAN STATUS HANDLER: plan=system-upgrade/busybox-test@658, status={Conditions:[] LatestVersion: LatestHash: Applying:[]}" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:00:33Z" level=debug msg="PLAN GENERATING HANDLER: plan=system-upgrade/busybox-test@658, status={Conditions:[{Type:LatestResolved Status:True LastUpdateTime:2022-07-13T22:00:33Z LastTransitionTime: Reason:Version Message:}] LatestVersion:1.35.0-uclibc LatestHash:7dfd8044aa7c456209bd9a936377d75caf7b42dde345cfd6535981d1 Applying:[]}" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:00:33Z" level=debug msg="DesiredSet - Created batch/v1, Kind=Job system-upgrade/apply-busybox-test-on-k3s-server-1-with-7dfd8044aa7c45620-3655e for system-upgrade-controller system-upgrade/busybox-test" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:00:33Z" level=error msg="error syncing 'system-upgrade/busybox-test': handler system-upgrade-controller: plans.upgrade.cattle.io \"busybox-test\" not found, handler system-upgrade-controller: plans.upgrade.cattle.io \"busybox-test\" not found, requeuing" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:00:33Z" level=error msg="error syncing 'system-upgrade/apply-busybox-test-on-k3s-server-1-with-7dfd8044aa7c45620-3655e': handler system-upgrade-controller: jobs.batch \"apply-busybox-test-on-k3s-server-1-with-7dfd8044aa7c45620-3655e\" not found, requeuing" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:01:03Z" level=debug msg="PLAN STATUS HANDLER: plan=system-upgrade/busybox-test@658, status={Conditions:[] LatestVersion: LatestHash: Applying:[]}" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:01:03Z" level=debug msg="PLAN GENERATING HANDLER: plan=system-upgrade/busybox-test@658, status={Conditions:[{Type:LatestResolved Status:True LastUpdateTime:2022-07-13T22:01:03Z LastTransitionTime: Reason:Version Message:}] LatestVersion:1.35.0-uclibc LatestHash:7dfd8044aa7c456209bd9a936377d75caf7b42dde345cfd6535981d1 Applying:[]}" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:01:03Z" level=debug msg="DesiredSet - Created batch/v1, Kind=Job system-upgrade/apply-busybox-test-on-k3s-server-1-with-7dfd8044aa7c45620-3655e for system-upgrade-controller system-upgrade/busybox-test" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:01:03Z" level=error msg="error syncing 'system-upgrade/busybox-test': handler system-upgrade-controller: plans.upgrade.cattle.io \"busybox-test\" not found, handler system-upgrade-controller: plans.upgrade.cattle.io \"busybox-test\" not found, requeuing" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:01:33Z" level=debug msg="PLAN STATUS HANDLER: plan=system-upgrade/busybox-test@658, status={Conditions:[] LatestVersion: LatestHash: Applying:[]}" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:01:33Z" level=debug msg="PLAN GENERATING HANDLER: plan=system-upgrade/busybox-test@658, status={Conditions:[{Type:LatestResolved Status:True LastUpdateTime:2022-07-13T22:01:33Z LastTransitionTime: Reason:Version Message:}] LatestVersion:1.35.0-uclibc LatestHash:7dfd8044aa7c456209bd9a936377d75caf7b42dde345cfd6535981d1 Applying:[]}" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:01:34Z" level=debug msg="DesiredSet - Created batch/v1, Kind=Job system-upgrade/apply-busybox-test-on-k3s-server-1-with-7dfd8044aa7c45620-3655e for system-upgrade-controller system-upgrade/busybox-test" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"
time="2022-07-13T22:01:34Z" level=error msg="error syncing 'system-upgrade/busybox-test': handler system-upgrade-controller: plans.upgrade.cattle.io \"busybox-test\" not found, handler system-upgrade-controller: plans.upgrade.cattle.io \"busybox-test\" not found, requeuing" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:314"

The not found errors for both the plan and job resources seems to indicate that some of the informer caches aren't getting started properly? If I delete the SUC pod, the next controller pod that comes up creates the job successfully:

brandond@dev01:~/suc-test$ kubectl describe job -n system-upgrade -l upgrade.cattle.io/plan=busybox-test
Name:                     apply-busybox-test-on-k3s-server-1-with-7dfd8044aa7c45620-3655e
Namespace:                system-upgrade
Selector:                 controller-uid=267c240d-ccdb-4901-9c99-927fc4bec3cb
Labels:                   objectset.rio.cattle.io/hash=6fb02a1a07227c1f395e337028a47ca6942caaa2
                          plan.upgrade.cattle.io/busybox-test=7dfd8044aa7c456209bd9a936377d75caf7b42dde345cfd6535981d1
                          upgrade.cattle.io/controller=system-upgrade-controller
                          upgrade.cattle.io/node=k3s-server-1
                          upgrade.cattle.io/plan=busybox-test
                          upgrade.cattle.io/version=1.35.0-uclibc
Annotations:              batch.kubernetes.io/job-tracking:
                          objectset.rio.cattle.io/applied:
                            H4sIAAAAAAAA/+xWW2/bOBP9KwWfJUe+yI4NfA/+YndrtHGMOu2iKIJgRI5srilSS46cGIb/+4KS4sjNpWn3dWHAEC8zPDxzZoZ7liGBAAI22jPQ2hCQNNr5oUn+Qk4OqWWlaXEgUt...
                          objectset.rio.cattle.io/id: system-upgrade-controller
                          objectset.rio.cattle.io/owner-gvk: upgrade.cattle.io/v1, Kind=Plan
                          objectset.rio.cattle.io/owner-name: busybox-test
                          objectset.rio.cattle.io/owner-namespace: system-upgrade
                          upgrade.cattle.io/ttl-seconds-after-finished: 900
Parallelism:              1
Completions:              1
Completion Mode:          NonIndexed
Start Time:               Wed, 13 Jul 2022 15:04:21 -0700
Completed At:             Wed, 13 Jul 2022 15:05:25 -0700
Duration:                 64s
Active Deadline Seconds:  900s
Pods Statuses:            0 Running / 1 Succeeded / 0 Failed
Pod Template:
  Labels:  controller-uid=267c240d-ccdb-4901-9c99-927fc4bec3cb
           job-name=apply-busybox-test-on-k3s-server-1-with-7dfd8044aa7c45620-3655e
           plan.upgrade.cattle.io/busybox-test=7dfd8044aa7c456209bd9a936377d75caf7b42dde345cfd6535981d1
           upgrade.cattle.io/controller=system-upgrade-controller
           upgrade.cattle.io/node=k3s-server-1
           upgrade.cattle.io/plan=busybox-test
           upgrade.cattle.io/version=1.35.0-uclibc
  Containers:
   upgrade:
    Image:      local.registry.k3s.khaus/7f19c02113f03500a5c8d72ce1a44965/d45508d8fdaa912cb352a97d14b7af1d/483e6eb76483325ea686ace1485b4b31/busybox:1.35.0-uclibc
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/sh
    Args:
      -c
      sleep 60
    Environment:
      SYSTEM_UPGRADE_NODE_NAME:             (v1:spec.nodeName)
      SYSTEM_UPGRADE_POD_NAME:              (v1:metadata.name)
      SYSTEM_UPGRADE_POD_UID:               (v1:metadata.uid)
      SYSTEM_UPGRADE_PLAN_NAME:            busybox-test
      SYSTEM_UPGRADE_PLAN_LATEST_HASH:     7dfd8044aa7c456209bd9a936377d75caf7b42dde345cfd6535981d1
      SYSTEM_UPGRADE_PLAN_LATEST_VERSION:  1.35.0-uclibc
    Mounts:
      /host from host-root (rw)
      /run/system-upgrade/pod from pod-info (ro)
  Volumes:
   host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  Directory
   pod-info:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
      metadata.annotations -> annotations
Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  107s  job-controller  Created pod: apply-busybox-test-on-k3s-server-1-with-7dfd8044aa7c45620-829jw
  Normal  Completed         43s   job-controller  Job completed

Jul 13 '22 22:07 brandond

@brandond Could you please clarify whether the issue was reproduced or not? It's not clear from the comment above. Thanks!

Jul 19 '22 18:07 janeczku

No, I was not able to reproduce any sort of image name truncation.

There does appear to be a different issue that causes the initial run of the controller to throw errors about not finding resources that do exist in the cluster, but that is not the issue that is being discussed here.

Jul 19 '22 18:07 brandond

Okay, thanks for confirming. That matches my observations of not reproducing the issue with system-upgrade-controller v0.9.1:

When defining a plan with an image that has a path deeper than 2, then both the job and the pod spawned to execute the job are defined with the full image image name. It's not truncated.

Here is the image i used: quay.io/janek/eee1da0de3b/00eaad3b/00ea1d192/0e323de3b/busybox:latest

Edit: Also tried with the image from the OP with the same result of not being truncated.

quay.io/janek/bbbb/ccccccccccccccccc/dddddddddddddddddd/eeeeeeeeee/xxxxxx/master:981a1f90cfcad63ea864957f996cde9b2e99e587

Jul 19 '22 18:07 janeczku

@webD97 Could you provide the full content of the plan that causes this issue for you? Along with the version of Kubernetes you are using?

Jul 19 '22 18:07 janeczku

cannot reproduce

Aug 22 '22 16:08 samjustus

system-upgrade-controller system-upgrade-controller copied to clipboard

Spawned jobs use broken (truncated) image names

system-upgrade-controller
system-upgrade-controller copied to clipboard