We should investigate why #3093 is necessary

Open awgreene opened this issue 2 years ago • 0 comments

Bug Report

The Image Update test uploads a couple of catalog images to an internal image registry which are then used in the test. Recently, the Image Update test began failing because of an authentication issue against the internal registry. In the past, the catalogSource pod would experience an authentication issue but eventually succeed; today the authentication issue never resolves. Some notes:

Prior to introducing the changes in #3093, I noticed that the test would pass if you manually deleted the pod after the image pull error.
We believe that the change in behavior might be a biproduct of this commit.

Here's an example of the failing pod yaml:

apiVersion: v1
kind: Pod
metadata:
  labels:
    olm.catalogSource: catalog-v4gtd
  name: catalog-v4gtd-gwrx8
  namespace: openshift-catsrc-e2e-9lcpt
  ownerReferences:
  - apiVersion: operators.coreos.com/v1alpha1
    blockOwnerDeletion: false
    controller: true
    kind: CatalogSource
    name: catalog-v4gtd
    uid: 78b75952-f02e-4866-bb40-c5e9934fa70a
  resourceVersion: "48814"
  uid: b5c8ede2-b534-42b0-919e-2776ce5e045d
spec:
  containers:
  - image: image-registry.openshift-image-registry.svc:5000/openshift-catsrc-e2e-9lcpt/catsrc-update:xhgmp7
    imagePullPolicy: Always
    livenessProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=:50051
      failureThreshold: 3
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    name: registry-server
    ports:
    - containerPort: 50051
      name: grpc
      protocol: TCP
    readinessProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=:50051
      failureThreshold: 3
      initialDelaySeconds: 5
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    resources:
      requests:
        cpu: 10m
        memory: 50Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: false
      runAsNonRoot: true
      runAsUser: 1000690000
    startupProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=:50051
      failureThreshold: 10
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-7rbbq
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: ip-10-0-67-131.ec2.internal
  nodeSelector:
    kubernetes.io/os: linux
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1000690000
    seLinuxOptions:
      level: s0:c26,c20
    seccompProfile:
      type: RuntimeDefault
  serviceAccount: catalog-v4gtd
  serviceAccountName: catalog-v4gtd
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - name: kube-api-access-7rbbq
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
      - configMap:
          items:
          - key: service-ca.crt
            path: service-ca.crt
          name: openshift-service-ca.crt
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-11-06T21:31:43Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-11-06T21:31:43Z"
    message: 'containers with unready status: [registry-server]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-11-06T21:31:43Z"
    message: 'containers with unready status: [registry-server]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-11-06T21:31:43Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: image-registry.openshift-image-registry.svc:5000/openshift-catsrc-e2e-9lcpt/catsrc-update:xhgmp7
    imageID: ""
    lastState: {}
    name: registry-server
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: Back-off pulling image "image-registry.openshift-image-registry.svc:5000/openshift-catsrc-e2e-9lcpt/catsrc-update:xhgmp7"
        reason: ImagePullBackOff
  hostIP: 10.0.67.131
  phase: Pending
  podIP: 10.128.2.17
  podIPs:
  - ip: 10.128.2.17
  qosClass: Burstable
  startTime: "2023-11-06T21:31:43Z"

This ticket can be closed once we identify why the pod isn't able to pull from the internal registry.

Nov 07 '23 12:11 awgreene