rukpak icon indicating copy to clipboard operation
rukpak copied to clipboard

Bundle unpack Pod gets continuously re-created when the unpack binary fails

Open timflannagan opened this issue 3 years ago • 8 comments

The Bundle controller continuously re-creates the unpack Pod when the unpack binary fails to unpack the Bundle's spec.Image container image. It's easy to re-produce this behavior by creating an invalid plain+v0 bundle container, and running the provisioner on a local cluster through make run KIND_CLUSTER_NAME=kind, and creating the following Bundle resources:

apiVersion: core.rukpak.io/v1alpha1
kind: Bundle
metadata:
  name: olm-v0.20.0
spec:
  image: quay.io/tflannag/olm-plain-bundle:olm-v0.20.0-invalid
  provisionerClassName: core.rukpak.io/plain-v0

---

apiVersion: core.rukpak.io/v1alpha1
kind: BundleDeployment
metadata:
  name: olm-v0.20.0
spec:
  bundleName: olm-v0.20.0
  provisionerClassName: core.rukpak.io/plain-v0

And wait until the Bundle reports an infinite Bundle Pending phase:

k get bundles olm-v0.20.0 -o yaml
apiVersion: core.rukpak.io/v1alpha1
kind: Bundle
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"core.rukpak.io/v1alpha1","kind":"Bundle","metadata":{"annotations":{},"name":"olm-v0.20.0"},"spec":{"image":"quay.io/tflannag/olm-plain-bundle:olm-v0.20.0-combined","provisionerClassName":"core.rukpak.io/plain-v0"}}
  creationTimestamp: "2022-03-08T17:16:30Z"
  generation: 1
  name: olm-v0.20.0
  resourceVersion: "5796"
  uid: dcf76eaa-eb45-4e68-b791-47c8fea92c09
spec:
  image: quay.io/tflannag/olm-plain-bundle:olm-v0.20.0-combined
  provisionerClassName: core.rukpak.io/plain-v0
status:
  conditions:
  - lastTransitionTime: "2022-03-08T17:16:30Z"
    message: ""
    reason: UnpackPending
    status: "False"
    type: Unpacked
  observedGeneration: 1
  phase: Pending

When diving into this unpack Pod, locally you can see Bundle unpack Pod is being recycled quite a bit:

$ k get pods plain-v0-unpack-bundle-olm-v0.20.0 -w
NAME                                 READY   STATUS            RESTARTS   AGE
plain-v0-unpack-bundle-olm-v0.20.0   0/1     PodInitializing   0          4s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Error             0          4s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Terminating       0          4s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Terminating       0          4s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Pending           0          0s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Pending           0          0s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Init:0/1          0          1s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     PodInitializing   0          2s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Error             0          4s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Terminating       0          4s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Terminating       0          4s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Pending           0          0s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Pending           0          0s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Init:0/1          0          1s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     PodInitializing   0          2s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Error             0          4s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Terminating       0          4s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Terminating       0          4s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Pending           0          1s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Pending           0          1s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Init:0/1          0          1s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     PodInitializing   0          2s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Error             0          4s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Terminating       0          4s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Terminating       0          4s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Pending           0          0s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Pending           0          1s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     Init:0/1          0          1s
plain-v0-unpack-bundle-olm-v0.20.0   0/1     PodInitializing   0          2s
...

And the plain-v0 provisioner logs are the only place we expose this failed state:

1.6467603047930653e+09	DEBUG	controller.bundle	ending reconciliation	{"reconciler group": "core.rukpak.io", "reconciler kind": "Bundle", "name": "olm-v0.20.0", "namespace": ""}
1.6467603047931132e+09	ERROR	controller.bundle	Reconciler error	{"reconciler group": "core.rukpak.io", "reconciler kind": "Bundle", "name": "olm-v0.20.0", "namespace": "", "error": "unpack failed: 2022/03/08 17:25:03 walk bundle dir \"/manifests\": stat /manifests/.: not a directory\n"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/tflannag/Documents/operator-framework/rukpak/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/home/tflannag/Documents/operator-framework/rukpak/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
1.6467603047932153e+09	DEBUG	controller.bundle	starting reconciliation	{"reconciler group": "core.rukpak.io", "reconciler kind": "Bundle", "name": "olm-v0.20.0", "namespace": ""}
1.6467603048062544e+09	INFO	controller.bundle	applying status changes	{"reconciler group": "core.rukpak.io", "reconciler kind": "Bundle", "name": "olm-v0.20.0", "namespace": ""}
...

This behavior differs from the normal state transitions I've expect for a container that fails within a Pod managed by a Deployment/ReplicaSet where the eventual state is a CrashLoopBackOff or Error which is easier to debug, and the container gets restarted by the higher level controller.

timflannagan avatar Mar 08 '22 17:03 timflannagan

I think this can be solved with static validation tooling, but it would be nice to expose failure states to the Bundle/BundleDeployment status', or even firing off events to make debugging invalid Bundle sources (e.g. a poorly formatted container image) an easier experience.

timflannagan avatar Mar 08 '22 17:03 timflannagan

I can't remember for sure, but I thought the provisioner logic for recreating the failed pod was subject to exponential backoff? That seems reasonable to me. But I agree that pod status and/or logs in that case should be propagated to the Bundle status.

joelanford avatar Mar 08 '22 18:03 joelanford

Exploring this. Out of curiosity, how is the original bundle malformed?

nsapse avatar Mar 25 '22 19:03 nsapse

@nsapse Basically the /manifests path in the container image was a file and not a directory like the provisioner was expecting.

timflannagan avatar Mar 25 '22 19:03 timflannagan

Okay played around with this super briefly while watching the bruins: it looks like this is reproducible when the generated unpack Pod references a volume mount that doesn't exist (e.g. a ConfigMap volume source where the corev1.ObjectReference isn't present on the cluster), and as a result, the unpack Pod is stuck in the Pending phase until kubelet gives up waiting for that volume to be present before starting the contains in the unpack Pod.

When adding some minor debug logs, we're getting back an controllerutil.OperationResultUpdated result during r.ensureUnpackPod which implies that the desired and existing Pod's aren't semantically equal:

https://github.com/operator-framework/rukpak/blob/d868cf228aecc4bcd1ff1d1a3b32ac4177ca6717/internal/provisioner/plain/controllers/bundle_controller.go#L158-L161

Adding a debug log that prints out cmp.Diff between the existing and desired client.Object, it appears the issue is due to defaulting for the various volume source fields:

  	TypeMeta:   {Kind: "Pod", APIVersion: "v1"},
  	ObjectMeta: {Name: "plain-unpack-bundle-olm-t7j46", Namespace: "rukpak-system", UID: "c8cb55a7-d1f4-4878-b6b2-215ba5d2af50", ResourceVersion: "37758", ...},
  	Spec: v1.PodSpec{
  		Volumes: []v1.Volume{
  			{Name: "util", VolumeSource: {EmptyDir: &{}}},
  			{
  				Name: "test",
  				VolumeSource: v1.VolumeSource{
  					... // 16 identical fields
  					FC:        nil,
  					AzureFile: nil,
  					ConfigMap: &v1.ConfigMapVolumeSource{
  						LocalObjectReference: {Name: "non-existent"},
  						Items:                nil,
- 						DefaultMode:          &420,
+ 						DefaultMode:          nil,
  						Optional:             nil,
  					},
  					VsphereVolume: nil,
  					Quobyte:       nil,
  					... // 8 identical fields
  				},
  			},
  		},

Which might imply that we need a comment in the r.ensureUnpackPod(...) method so reviews/authors that touch that method understand that behavior.

timflannagan avatar May 11 '22 00:05 timflannagan

Randomly thinking about this more: is this also an issue for mutating webhooks that inject custom Pod annotations/labels/etc., and our CreateOrRecreate logic doesn't accommodate for that edge case?

timflannagan avatar Sep 02 '22 22:09 timflannagan

This issue has become stale because it has been open 60 days with no activity. The maintainers of this repo will remove this label during issue triage or it will be removed automatically after an update. Adding the lifecycle/frozen label will cause this issue to ignore lifecycle events.

github-actions[bot] avatar Nov 02 '22 00:11 github-actions[bot]

I think the problem here is that when something goes wrong with the pod, we delete the pod, error out, and then start again with a fresh pod, thus restarting the exponential backoff. This goes on forever because we're generally always successfully reconciling several times per cycle after creating a fresh pod.

Maybe a Job would help here because it would abstract us from the underlying pod recycling?

joelanford avatar Jul 05 '23 12:07 joelanford