cartographer icon indicating copy to clipboard operation
cartographer copied to clipboard

Delayed runnable refresh.

Open sfzylad opened this issue 2 years ago • 2 comments

Bug description:

We recently found out that it takes a while for runnable to figure out the underlying object is completed. For instance:

  conditions:
  - lastTransitionTime: "2022-09-29T22:13:10Z"
    message: ""
    reason: Ready
    status: "True"
    type: RunTemplateReady
  - lastTransitionTime: "2022-09-30T02:23:56Z"
    message: ""
    reason: SucceededCondition
    status: "True"
    type: StampedObjectCondition
  - lastTransitionTime: "2022-09-30T02:23:56Z"
    message: ""
    reason: Ready
    status: "True"
    type: Ready

It took 4+ hours to transition from RunTemplateReady to SucceededCondition. The underlaying object was Tekton PipelineRun and it was succeeded in 26 seconds:

  conditions:
  - lastTransitionTime: "2022-09-29T22:13:36Z"
    message: 'Tasks Completed: 1 (Failed: 0, Cancelled 0), Skipped: 0'
    reason: Succeeded
    status: "True"
    type: Succeeded

Steps to reproduce:

Unfortunately I cannot have any clear path to reproduction.

Expected behavior:

Just after the PipelineRun gets its Succeeded condition set to True, Runnable should set its Ready condition to True.

Actual behavior:

Workload that stamped the Runnable was waiting for the output from Runanble and the Runnable didn't refresh its state for over 4 hours although the underlying PipelineRun resource completed almost instantaneously

Logs, k8s object dumps:

apiVersion: carto.run/v1alpha1
kind: Runnable
metadata:
  creationTimestamp: "2022-09-29T22:13:08Z"
  generation: 1
  labels:
    carto.run/cluster-template-name: sign-image-template-verify-dzyla-v0-0-3
    carto.run/resource-name: sign-image
    carto.run/supply-chain-name: supply-chain-verify-dzyla-v0-0-3
    carto.run/template-kind: ClusterImageTemplate
    carto.run/workload-name: foo-bar
    carto.run/workload-namespace: default
  name: 2b50086e-340b-41b6-8b53-e1ed40b761ed-store
  namespace: default
  ownerReferences:
  - apiVersion: carto.run/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Workload
    name: foo-bar
    uid: 2b50086e-340b-41b6-8b53-e1ed40b761ed
  resourceVersion: "107440535"
  uid: e759ce93-1f33-4c84-94e0-83418ef75401
spec:
  inputs:
    image: example.com:tag
    url: example.com@sha:digest
  retentionPolicy:
    maxFailedRuns: 10
    maxSuccessfulRuns: 10
  runTemplateRef:
    name: sign-image-tekton-pipelinerun-verify-dzyla-v0-0-3
  serviceAccountName: verify-sa
status:
  conditions:
  - lastTransitionTime: "2022-09-29T22:13:10Z"
    message: ""
    reason: Ready
    status: "True"
    type: RunTemplateReady
  - lastTransitionTime: "2022-09-30T02:23:56Z"
    message: ""
    reason: SucceededCondition
    status: "True"
    type: StampedObjectCondition
  - lastTransitionTime: "2022-09-30T02:23:56Z"
    message: ""
    reason: Ready
    status: "True"
    type: Ready
  observedGeneration: 1
  outputs:
    # outputs here

Versions:

  • Infrastructure (kind, TKG, GKE etc.)
  • cartographer version: v0.5.0
  • k8s version: v1.21.6
  • tekton-pipelines v0.40.1

Deployment info:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "2"
  creationTimestamp: "2022-05-09T17:32:18Z"
  generation: 6
  labels:
    app.kubernetes.io/component: cartographer
    app.kubernetes.io/name: cartographer-controller
    app.kubernetes.io/version: v0.5.0
  name: cartographer-controller
  namespace: cartographer-system
  resourceVersion: "107201119"
  uid: 6214fba2-0089-46c3-95c8-4bc18ae9106f
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: cartographer-controller
      kapp.k14s.io/app: "1652117534801234013"
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%                                                                                                                                                                                                                                                                                                                                                         [0/9023]
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: cartographer-controller
        kapp.k14s.io/app: "1652117534801234013"
        kapp.k14s.io/association: v1.c55856bac76106a4a463f38db29588aa
    spec:
      containers:
      - args:
        - -cert-dir=/cert
        image: example.com/cache/projectcartographer/cartographer@sha256:c6e5bca54b39ec595f00f5646f809c9929322a603958805a2d826fd3cb2ee1f3
        imagePullPolicy: IfNotPresent
        name: cartographer-controller
        resources:
          limits:
            cpu: "1"
            memory: 1Gi
          requests:
            cpu: 500m
            memory: 512Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - all
          readOnlyRootFilesystem: true
          runAsNonRoot: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /cert
          name: cert
          readOnly: true
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: cartographer-controller
      serviceAccountName: cartographer-controller
      terminationGracePeriodSeconds: 30
      volumes:
      - name: cert
        secret:
          defaultMode: 420
          secretName: cartographer-webhook
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2022-05-09T17:32:18Z"
    lastUpdateTime: "2022-08-25T18:15:56Z"
    message: ReplicaSet "cartographer-controller-6dcfd6c4c" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2022-08-31T20:31:17Z"
    lastUpdateTime: "2022-08-31T20:31:17Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 6
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

The cartographer was deployed as imgpkg bundle using a Carvel App CRD.

Additional context:

It happens randomly so it's hard to reproduce.

sfzylad avatar Sep 30 '22 16:09 sfzylad

Thanks for bringing this to our attention.

Can you share how many other Runnables / Workloads / Deliverables running on this cluster?

Also, were there any Pod restarts on the Cartographer controller?

idoru avatar Sep 30 '22 17:09 idoru

Can you share how many other Runnables / Workloads / Deliverables running on this cluster?

This cluster is rather small because it's not production, however here's the numbers: workloads: 18 runnables: 23 deliverables: none

Note these data is from now, however it shouldn't differ much if at all from the time the problem occurred.

Also, were there any Pod restarts on the Cartographer controller?

The pod seems to run just fine:

NAME                                      READY   STATUS    RESTARTS   AGE
cartographer-controller-6dcfd6c4c-6v8f9   1/1     Running   0          29d

sfzylad avatar Sep 30 '22 19:09 sfzylad