argo-helm [argo-cd] Switch to bitnami/redis and bitnami/redis-cluster chart

[argo-cd] Switch to bitnami/redis and bitnami/redis-cluster chart

Open aslafy-z opened this issue 2 years ago • 6 comments

Is your feature request related to a problem?

I have some issues with the redis-ha chart. If some pods are destroyed, they don't synchronize back well and I have to delete all the pods and wait for all of them to become ready again.

Related helm chart

argo-cd

Describe the solution you'd like

I feel like this chart should use bitnami maintained charts which are now the "default" for a major part of the community.

See https://artifacthub.io/packages/helm/bitnami/redis https://artifacthub.io/packages/helm/bitnami/redis-cluster

Describe alternatives you've considered

No response

Additional context

No response

Jan 31 '22 11:01 aslafy-z

:+1: Actually it does not work on Openshift out-of-the-box. It needs to create RoleBindings and ServiceAccount specific for Redis before.

Mar 07 '22 14:03 gmoirod

The kustomize manifests living in the upstream project over there uses the rendered YAMLs from @dandydeveloper's chart https://github.com/argoproj/argo-cd/blob/v2.3.3/manifests/ha/base/redis-ha/chart/requirements.yaml

The intent of this helm-repository here is to use the same architecture as the upstream projects (Argo CD, Workflows, etc.). IMHO you should file an issue over there: https://github.com/argoproj/argo-cd/issues/new/choose

Apr 22 '22 12:04 mkilchhofer

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jun 22 '22 03:06 github-actions[bot]

No-stale

Jun 22 '22 08:06 aslafy-z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Aug 22 '22 03:08 github-actions[bot]

@mkilchhofer What's the issue here? Redis clustering vs. Sentinel are very different so this change could impact Argo quite a lot.

Feel free to raise the issue in my chart for Redis, I'm pretty active and try my best to maintain. Right now, I'm the only real active maintainer.

Aug 22 '22 03:08 DandyDeveloper

Hi @DandyDeveloper,

Context: We at @swisspost tried switching on redis-ha in the Argo CD chart. We used it like 1-2 month or so our AWS EKS clusters. We use cluster autoscaling and also upgrade our clusters once a week (new AWS AMI for the workers).

Issue: One problem we saw is that one of the 3 redis pods is unhappy:

$ kubectl logs argocd-server-6499778d-2n56j
(..)
redis: 2022/03/11 10:35:37 pubsub.go:168: redis: discarding bad PubSub connection: EOF
redis: 2022/03/11 10:35:37 pubsub.go:168: redis: discarding bad PubSub connection: EOF
redis: 2022/03/11 10:35:38 pubsub.go:168: redis: discarding bad PubSub connection: EOF
time="2022-03-11T10:35:38Z" level=warning msg="Failed to resync revoked tokens. retrying again in 1 minute: EOF"
redis: 2022/03/11 10:35:38 pubsub.go:168: redis: discarding bad PubSub connection: write tcp 10.116.191.151:46704->172.20.36.205:6379: write: broken pipe
redis: 2022/03/11 10:35:38 pubsub.go:168: redis: discarding bad PubSub connection: write tcp 10.116.191.151:46704->172.20.36.205:6379: write: broken pipe
redis: 2022/03/11 10:35:38 pubsub.go:168: redis: discarding bad PubSub connection: EOF
redis: 2022/03/11 10:35:38 pubsub.go:168: redis: discarding bad PubSub connection: EOF
redis: 2022/03/11 10:35:38 pubsub.go:168: redis: discarding bad PubSub connection: EOF
redis: 2022/03/11 10:35:38 pubsub.go:168: redis: discarding bad PubSub connection: EOF

The redis logs of at least one replica was full of:

kubectl logs argocd-redis-ha-server-1 -c redis
(..)
1:S 11 Mar 2022 07:09:31.312 * Non blocking connect for SYNC fired the event.
1:S 11 Mar 2022 07:09:31.312 * Master replied to PING, replication can continue...
1:S 11 Mar 2022 07:09:31.313 * Partial resynchronization not possible (no cached master)
1:S 11 Mar 2022 07:09:31.313 * Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master
1:S 11 Mar 2022 07:09:32.324 * Connecting to MASTER 172.20.145.11:6379
1:S 11 Mar 2022 07:09:32.324 * MASTER <-> REPLICA sync started
1:S 11 Mar 2022 07:09:32.324 * Non blocking connect for SYNC fired the event.
1:S 11 Mar 2022 07:09:32.325 * Master replied to PING, replication can continue...
1:S 11 Mar 2022 07:09:32.325 * Partial resynchronization not possible (no cached master)
1:S 11 Mar 2022 07:09:32.326 * Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master
1:S 11 Mar 2022 07:09:33.328 * Connecting to MASTER 172.20.145.11:6379
1:S 11 Mar 2022 07:09:33.328 * MASTER <-> REPLICA sync started
1:S 11 Mar 2022 07:09:33.329 * Non blocking connect for SYNC fired the event.
(..)

Resolution: We then always fixed it like this:

$ kubectl -n argocd delete po -l app=redis-ha
pod "argocd-redis-ha-server-0" deleted
pod "argocd-redis-ha-server-1" deleted
pod "argocd-redis-ha-server-2" deleted

And after 2 monthes of annoying redis issues we switched back to single-replica redis. After that we never faced a redis related issue again.

Sep 29 '22 12:09 mkilchhofer

@mkilchhofer How long ago was this?

We had a split brain scenario that was the result of bad Sentinel election. Its been resolved permanently a while back by introducing a pod for checking on this and explicitly solving split brain issues like above.

This is surface level assumption, I'd need more logs from the elected master / cluster state to provide more context.

The latest Argo should include the latest Redis chart, so, I would highly recommend trying this again.

Sep 30 '22 00:09 DandyDeveloper

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Nov 29 '22 02:11 github-actions[bot]

🎛️

Nov 29 '22 16:11 pierluigilenoci

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jan 30 '23 02:01 github-actions[bot]

Why closed???

Feb 06 '23 14:02 pierluigilenoci

@mkilchhofer ?

Feb 06 '23 14:02 pierluigilenoci

we are seeing this issue too. not sure why it’s been closed.

Apr 23 '23 03:04 pdeva

I maintain the redis being used, the problem in question should be resolved long ago.

If people are experiencing problems, throw me a link to the issue or describe the issue so I can investigate.

I believe they closed this because my reply indicated things are fixed and we had no follow up.

Apr 23 '23 04:04 DandyDeveloper

argo-helm argo-helm copied to clipboard

[argo-cd] Switch to bitnami/redis and bitnami/redis-cluster chart

Is your feature request related to a problem?

Related helm chart

Describe the solution you'd like

Describe alternatives you've considered

Additional context

argo-helm
argo-helm copied to clipboard