cartographer
cartographer copied to clipboard
Delayed runnable refresh.
Bug description:
We recently found out that it takes a while for runnable to figure out the underlying object is completed. For instance:
conditions:
- lastTransitionTime: "2022-09-29T22:13:10Z"
message: ""
reason: Ready
status: "True"
type: RunTemplateReady
- lastTransitionTime: "2022-09-30T02:23:56Z"
message: ""
reason: SucceededCondition
status: "True"
type: StampedObjectCondition
- lastTransitionTime: "2022-09-30T02:23:56Z"
message: ""
reason: Ready
status: "True"
type: Ready
It took 4+ hours to transition from RunTemplateReady
to SucceededCondition
. The underlaying object was Tekton PipelineRun and it was succeeded in 26 seconds:
conditions:
- lastTransitionTime: "2022-09-29T22:13:36Z"
message: 'Tasks Completed: 1 (Failed: 0, Cancelled 0), Skipped: 0'
reason: Succeeded
status: "True"
type: Succeeded
Steps to reproduce:
Unfortunately I cannot have any clear path to reproduction.
Expected behavior:
Just after the PipelineRun gets its Succeeded condition set to True, Runnable should set its Ready condition to True.
Actual behavior:
Workload that stamped the Runnable was waiting for the output from Runanble and the Runnable didn't refresh its state for over 4 hours although the underlying PipelineRun resource completed almost instantaneously
Logs, k8s object dumps:
apiVersion: carto.run/v1alpha1
kind: Runnable
metadata:
creationTimestamp: "2022-09-29T22:13:08Z"
generation: 1
labels:
carto.run/cluster-template-name: sign-image-template-verify-dzyla-v0-0-3
carto.run/resource-name: sign-image
carto.run/supply-chain-name: supply-chain-verify-dzyla-v0-0-3
carto.run/template-kind: ClusterImageTemplate
carto.run/workload-name: foo-bar
carto.run/workload-namespace: default
name: 2b50086e-340b-41b6-8b53-e1ed40b761ed-store
namespace: default
ownerReferences:
- apiVersion: carto.run/v1alpha1
blockOwnerDeletion: true
controller: true
kind: Workload
name: foo-bar
uid: 2b50086e-340b-41b6-8b53-e1ed40b761ed
resourceVersion: "107440535"
uid: e759ce93-1f33-4c84-94e0-83418ef75401
spec:
inputs:
image: example.com:tag
url: example.com@sha:digest
retentionPolicy:
maxFailedRuns: 10
maxSuccessfulRuns: 10
runTemplateRef:
name: sign-image-tekton-pipelinerun-verify-dzyla-v0-0-3
serviceAccountName: verify-sa
status:
conditions:
- lastTransitionTime: "2022-09-29T22:13:10Z"
message: ""
reason: Ready
status: "True"
type: RunTemplateReady
- lastTransitionTime: "2022-09-30T02:23:56Z"
message: ""
reason: SucceededCondition
status: "True"
type: StampedObjectCondition
- lastTransitionTime: "2022-09-30T02:23:56Z"
message: ""
reason: Ready
status: "True"
type: Ready
observedGeneration: 1
outputs:
# outputs here
Versions:
- Infrastructure (kind, TKG, GKE etc.)
- cartographer version: v0.5.0
- k8s version: v1.21.6
- tekton-pipelines v0.40.1
Deployment info:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "2"
creationTimestamp: "2022-05-09T17:32:18Z"
generation: 6
labels:
app.kubernetes.io/component: cartographer
app.kubernetes.io/name: cartographer-controller
app.kubernetes.io/version: v0.5.0
name: cartographer-controller
namespace: cartographer-system
resourceVersion: "107201119"
uid: 6214fba2-0089-46c3-95c8-4bc18ae9106f
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: cartographer-controller
kapp.k14s.io/app: "1652117534801234013"
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25% [0/9023]
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: cartographer-controller
kapp.k14s.io/app: "1652117534801234013"
kapp.k14s.io/association: v1.c55856bac76106a4a463f38db29588aa
spec:
containers:
- args:
- -cert-dir=/cert
image: example.com/cache/projectcartographer/cartographer@sha256:c6e5bca54b39ec595f00f5646f809c9929322a603958805a2d826fd3cb2ee1f3
imagePullPolicy: IfNotPresent
name: cartographer-controller
resources:
limits:
cpu: "1"
memory: 1Gi
requests:
cpu: 500m
memory: 512Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- all
readOnlyRootFilesystem: true
runAsNonRoot: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /cert
name: cert
readOnly: true
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: cartographer-controller
serviceAccountName: cartographer-controller
terminationGracePeriodSeconds: 30
volumes:
- name: cert
secret:
defaultMode: 420
secretName: cartographer-webhook
status:
availableReplicas: 1
conditions:
- lastTransitionTime: "2022-05-09T17:32:18Z"
lastUpdateTime: "2022-08-25T18:15:56Z"
message: ReplicaSet "cartographer-controller-6dcfd6c4c" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
- lastTransitionTime: "2022-08-31T20:31:17Z"
lastUpdateTime: "2022-08-31T20:31:17Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
observedGeneration: 6
readyReplicas: 1
replicas: 1
updatedReplicas: 1
The cartographer was deployed as imgpkg bundle using a Carvel App CRD.
Additional context:
It happens randomly so it's hard to reproduce.
Thanks for bringing this to our attention.
Can you share how many other Runnables / Workloads / Deliverables running on this cluster?
Also, were there any Pod restarts on the Cartographer controller?
Can you share how many other Runnables / Workloads / Deliverables running on this cluster?
This cluster is rather small because it's not production, however here's the numbers: workloads: 18 runnables: 23 deliverables: none
Note these data is from now, however it shouldn't differ much if at all from the time the problem occurred.
Also, were there any Pod restarts on the Cartographer controller?
The pod seems to run just fine:
NAME READY STATUS RESTARTS AGE
cartographer-controller-6dcfd6c4c-6v8f9 1/1 Running 0 29d