kubernetes-replicator
kubernetes-replicator copied to clipboard
kubernetes-replicator pod crashes when updating secrets
Describe the bug kubernetes-replicator pod crashes when replicating single secret across 150+ namespaces
To Reproduce Installed using helm
helm repo add mittwald https://helm.mittwald.de
helm upgrade --version v2.6.3 --install kubernetes-replicator mittwald/kubernetes-replicator --namespace kubernetes-replicator
Expected behavior pod should not crash so that all secrets are replicated to all namespaces
Environment:
- Kubernetes version: [1.22]
- kubernetes-replicator version: [2.6.3]
Additional context Additional details about pod termination
- Increased pod resource quotas but still pod crashed
- Increased replica count to 3 but all 3 bods crashed
- Updated to latest version replicator 2.7.3 still pod crashed
terminated:
exitCode: 2
finishedAt: "2022-09-13T21:23:52Z"
reason: Error
startedAt: "2022-09-13T21:14:14Z"
name: kubernetes-replicator
ready: false
restartCount: 1
This is the message after the pod restart
kubernetes-replicator-768465d6d7-4mx78:kubernetes-replicator time="2022-09-13T21:24:09Z" level=error msg="could not replicate object to other namespaces" error="Replicated kubernetes-replicator/xyz.com.registry.creds to 70 out of 155 namespaces
There has not been any activity to this issue in the last 14 days. It will automatically be closed after 7 more days. Remove the stale
label to prevent this.
Any update is appreciated
Apologies for the delay. Do you have any logs available from when the controller crashed? Those would help isolate the issue, and also to determine if this is the same issue as #214.
These are the logs from pod
level=error msg="could not replicate object to other namespaces" error="Replicated kubernetes-replicator/xxxxxxxx.xxxxx.creds to 157 out of 178 namespaces: 21 errors occurred:\n\t*
and pod crashes with this error
lastState:
terminated:
containerID: containerd://63cbde6dd2d3a1659ce116e83c9545312d1482024c1548704aab755f3dac6313
exitCode: 2
finishedAt: "2022-10-13T20:20:54Z"
reason: Error
startedAt: "2022-10-13T20:14:55Z"
After further debug container in the pod is killed due to liveness probe failure , made changes to periodSeconds
from 10 to 60 on both liveness and readiness probes noticed container is not killed and no replication failures. That said container is being killed only when replicator start replicating the secrets and probes are failing, so something is blocking requests from probes when secrets are replicating.
There has not been any activity to this issue in the last 14 days. It will automatically be closed after 7 more days. Remove the stale
label to prevent this.
I am experiencing this same issue.
https://github.com/mittwald/kubernetes-replicator/blob/f3d5125b0065c0d2ee813437df0967f3d6aca535/liveness/handle.go#L26
Its from right here. If the status is not "synced" for all resources, then it will report unhealthy. This shouldn't be the case for a liveness probe. a sync that is not yet completed doesn't mean that it application should report unhealthy
We're still experiencing this issue, is there an update? I noticed PR check failed on spinning up a KIND cluster, can we rerun?