kubernetes-reflector icon indicating copy to clipboard operation
kubernetes-reflector copied to clipboard

[BUG] reflector does not update secrets anymore after some time - restart fixes the problem

Open aeimer opened this issue 3 years ago • 9 comments

Hi guys,

we are happily using reflector in our clusters. But after some time the pod is still running, but secrets are not updated/created anymore. When I restart the deployment the new pod instantly creates or updates the target secrets. So it does not seem to be a memory leak: grafik Although there is some kind of pattern in the memory stats.

I can't check the logs too far in the past, but there were no logs before the restart for at least 30 days.

We encountered this problem several times on multiple cluster, is there any way we can further debug this issue?

Best regards Alex

aeimer avatar Jun 27 '22 07:06 aeimer

Hi,

I am encountering the same issue as @aeimer on my cluster which is running v6.1.23.

When the issue occurs all logging stops and the process appears to not be doing anything anymore. A restart of the deployment or a deletion of the "stuck pod" fixes the issue when the reflector service restarts. For me on the last freeze the very last line looked like this:

2022-07-02 06:41:09.338 +00:00 [INF] () Starting host
2022-07-02 06:41:11.618 +00:00 [INF] (ES.Kubernetes.Reflector.Core.NamespaceWatcher) Requesting V1Namespace resources
2022-07-02 06:41:11.764 +00:00 [INF] (ES.Kubernetes.Reflector.Core.SecretWatcher) Requesting V1Secret resources
2022-07-02 06:41:11.830 +00:00 [INF] (ES.Kubernetes.Reflector.Core.ConfigMapWatcher) Requesting V1ConfigMap resources
2022-07-02 06:41:13.622 +00:00 [INF] (ES.Kubernetes.Reflector.Core.SecretMirror) Auto-reflected cert-manager/xx-xxxx-net-lego-cert where permitted. Created 0 - Updated 5 - Deleted 0 - Validated 5.

And then nothing else in the logs... just stuck there.

After i restarted the deployment the log output was exactly the same but with no validation done Created 0 - Updated 5 - Deleted 0 - Validated 0.

Here are logs after a restart and a successful patch of the auto-reflected resources:

2022-07-21 06:41:09.338 +00:00 [INF] () Starting host
2022-07-21 06:41:11.618 +00:00 [INF] (ES.Kubernetes.Reflector.Core.NamespaceWatcher) Requesting V1Namespace resources
2022-07-21 06:41:11.764 +00:00 [INF] (ES.Kubernetes.Reflector.Core.SecretWatcher) Requesting V1Secret resources
2022-07-21 06:41:11.830 +00:00 [INF] (ES.Kubernetes.Reflector.Core.ConfigMapWatcher) Requesting V1ConfigMap resources
2022-07-21 06:41:13.622 +00:00 [INF] (ES.Kubernetes.Reflector.Core.SecretMirror) Auto-reflected cert-manager/hs-x-net-lego-cert where permitted. Created 0 - Updated 5 - Deleted 0 - Validated 0.
2022-07-21 06:41:14.514 +00:00 [INF] (ES.Kubernetes.Reflector.Core.SecretMirror) Patched default/hs-x-net-lego-cert as a reflection of cert-manager/hs-x-net-lego-cert
2022-07-21 06:41:14.575 +00:00 [INF] (ES.Kubernetes.Reflector.Core.SecretMirror) Patched cluster-service/hs-x-net-lego-cert as a reflection of cert-manager/hs-x-net-lego-cert
2022-07-21 06:41:14.624 +00:00 [INF] (ES.Kubernetes.Reflector.Core.SecretMirror) Patched home-automation/hs-x-net-lego-cert as a reflection of cert-manager/hs-x-net-lego-cert
2022-07-21 06:41:14.667 +00:00 [INF] (ES.Kubernetes.Reflector.Core.SecretMirror) Patched home-media/hs-x-net-lego-cert as a reflection of cert-manager/hs-x-net-lego-cert
2022-07-21 06:41:14.738 +00:00 [INF] (ES.Kubernetes.Reflector.Core.SecretMirror) Patched ingress/hs-x-net-lego-cert as a reflection of cert-manager/hs-x-net-lego-cert

So something in Core.SecretMirror() seems to be blocking perhaps?

JoshuaDodds avatar Jul 21 '22 06:07 JoshuaDodds

facing the same issue.

i believe it is somehow due to broken watchers (maybe due to dropped idle connection by a firewall for example?)

for me the secrets got synced again after the watcher session has been internally closed and restarted. so i believe setting the configuration.watcher.timeout to a low timeout like 60or 120 seems to be a workaround for that issue.

RolandOtta avatar Jul 22 '22 08:07 RolandOtta

The issue is indeed the watchers. I'm working on a new solution to monitor the watchers and restart them if they're dead. Fallback would be a timeout, but the problem with watcher timeout is that some clusters have thousands of secrets reflected and the watchers may timeout before everything is processed resulting in a performance degradation.

winromulus avatar Jul 22 '22 08:07 winromulus

The issue is indeed the watchers. I'm working on a new solution to monitor the watchers and restart them if they're dead. Fallback would be a timeout, but the problem with watcher timeout is that some clusters have thousands of secrets reflected and the watchers may timeout before everything is processed resulting in a performance degradation.

Let me know if you need help testing a beta version... i dont have thousands of secrets on my cluster though. Just a handful that get reflected to 3 other namespaces.

JoshuaDodds avatar Jul 22 '22 09:07 JoshuaDodds

Automatically marked as stale due to no recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 13 '22 09:08 stale[bot]

Calling attention to this bug again. Please do not close it. I am happy to help if needed.

JoshuaDodds avatar Aug 13 '22 11:08 JoshuaDodds

Removed stale label.

stale[bot] avatar Aug 13 '22 11:08 stale[bot]

Just following up as this problem is causing us quite the headache as well. We're running into the issue on every cluster we're using Reflector on, regardless of watcher timeout length (have tried 120, 60, and 30). We're copying less than 100 secrets per cluster.

Any updates on progress or expected timeline?

Sorry to be annoying - thx for the good work!

henkejosh avatar Aug 18 '22 00:08 henkejosh

Same issue sadly here. Reflecting, literally 1 secret across 7 namespaces only, so at least we know it is not load dependent.

Tomasz-Kluczkowski avatar Sep 21 '22 09:09 Tomasz-Kluczkowski

Can we add something like a liveness probe that automatically restart reflector pod if it stucks?

pbxqdown avatar Oct 02 '22 18:10 pbxqdown

Probably a good quick fix, if the problem cant be fixed nearly soon.

aeimer avatar Oct 04 '22 15:10 aeimer

Hi.

It would be nice if we're able through the helmchart to override the livelinessProb.

So I submitted a pr for this purpose: https://github.com/emberstack/kubernetes-reflector/pull/310

idrissneumann avatar Oct 13 '22 14:10 idrissneumann

A bit more information not found in previous messages:

  • happens on 6.1.47 (very last release at time of writing)
  • happened last time when I updated master nodes (so when API server got restarted, likely breaking the watcher).

From experience, in golang services, there is a RetryWatcher. Is there something similar in C# SDK?

f-ld avatar Oct 19 '22 14:10 f-ld

For anyone looking for a solution to this, it might be worth using https://github.com/ktsstudio/mirrors until a fix is pushed.

Rid avatar Nov 02 '22 12:11 Rid

I published a workaround based on cronjob here: https://github.com/emberstack/kubernetes-reflector/pull/310

But this seems a more better alternative if you got time to replace reflector by something else:

For anyone looking for a solution to this, it might be worth using https://github.com/ktsstudio/mirrors until a fix is pushed.

idrissneumann avatar Nov 09 '22 10:11 idrissneumann

I published a workaround based on cronjob here: #310

@idrissneumann Thx for the workaround! That's essentially what we've been using as a stop gap - a cronjob that runs and restarts the pod every 10 minutes. It's obviously very inelegant, but it's been working without issue for the past few weeks.

But this seems a more better alternative if you got time to replace reflector by something else:

For anyone looking for a solution to this, it might be worth using https://github.com/ktsstudio/mirrors until a fix is pushed.

I'm curious for your thoughts - and @Rid 's - for why this mirrors solution might be a stronger alternative. From a quick glance at the repo it doesn't seem to have any active work/maintenance on it (last commit was a README change 3 months ago), which is the major issue with this Kubernetes-Reflect repo.

Have you guys used mirrors successfully, or do you know the owner, or have other reasons for recommending it? I'm certainly open to it but wary of adopting something new unless it has a strong maintainer behind it.

henkejosh avatar Nov 15 '22 22:11 henkejosh

It might be worth to check if External Secrets Operator can handle the same task. You can define a ClusterSecretStore with a k8s source. might be possible to define the source and target k8s as the same. Not tested yet, but will...

https://external-secrets.io/v0.6.1/provider/kubernetes/

dada-engineer avatar Nov 16 '22 07:11 dada-engineer

Just wanted to echo the above - we're having the same issue with Reflector.

marktlc avatar Feb 23 '23 22:02 marktlc

Please try the new version. This issue should be fixed. Please reopen if this is still a problem (some scenarios are extremely hard to reproduce and help is required to validate the fix).

winromulus avatar Mar 05 '23 01:03 winromulus

Hi @winromulus

Thanks for the update here. I've updated our version of Reflector to the latest. I waited approximately 2 hours before testing and the issue still appears to occur. I followed the steps below to reproduce the issue.

  1. Updated Reflector
  2. Approx 2hrs later - updated a secret.
  3. Secret did not replicate - nothing shown in the logs.
  4. Restarted Reflector, Secret then replicates successfully immediately.

The logs don't really offer anything of value, so haven't included, but let us know if we can provide any more diagnostic info.

Thanks, Mark

marktlc avatar Mar 05 '23 23:03 marktlc