charts icon indicating copy to clipboard operation
charts copied to clipboard

[stable/redis-ha] Fix using pod ip instead of service for announce-ip

Open tahajahangir opened this issue 2 years ago • 3 comments

What this PR does / why we need it:

Which issue this PR fixes

Root problem: Some times separate redis-ha installations hijack slaves from each-other. This occurs when a slave is registered by pod ip, not service-ip. And since pod ips are not fixed, the IP may be assigned to another (unfortunately redis) pod ....

A sample config-init container log for a bad pod, is as follows: (service cidr is 10.43.0.0/16, pod cidr is 10.42.0.0/16)

  Mon May 30 05:25:09 UTC 2022..
Could not connect to Redis at myredis:26379: Try again
Could not connect to Redis at myredis:26379: Try again
Could not connect to Redis at myredis:26379: Try again
  Mon May 30 05:25:54 UTC 2022 Did not find redis master ()
Identify announce ip for this pod..
  using (myredis-announce-0) or (myredis-server-0)
  identified announce (10.42.12.41)
Setting up defaults..

Checklist

  • [x] DCO signed
  • [x] Chart Version bumped
  • [x] Title of the PR starts with chart name (e.g. [stable/mychartname])

tahajahangir avatar May 30 '22 14:05 tahajahangir

@DandyDeveloper Can you review this?

tahajahangir avatar Jun 08 '22 14:06 tahajahangir

@tahajahangir

I'm just reminding myself on the code here but: https://github.com/DandyDeveloper/charts/blob/abb9d3eae00381ec65de3b2e50d809aa412dd175/charts/redis-ha/templates/_configs.tpl#L275-L287

We detect the announce IP here. Which gets the hosts (AKA Announce Service address) and only fallsback to the pod IP if the announce fails to resolve.

Isn't the bigger concern here why the initial host=$(getent hosts "${service}") is failing on your cluster with multiple releases?

I think I am in agreement with removing the pod resolving logic, but I think this was intentionally put there for a specific use case, so I'm more interested in why this change is necessary in the first place.

Is there anything in the log about the getent on the announce service address failing?

DandyDeveloper avatar Jun 28 '22 01:06 DandyDeveloper

@tahajahangir

I'm just reminding myself on the code here but:

https://github.com/DandyDeveloper/charts/blob/abb9d3eae00381ec65de3b2e50d809aa412dd175/charts/redis-ha/templates/_configs.tpl#L275-L287

We detect the announce IP here. Which gets the hosts (AKA Announce Service address) and only fallsback to the pod IP if the announce fails to resolve.

Isn't the bigger concern here why the initial host=$(getent hosts "${service}") is failing on your cluster with multiple releases?

I think I am in agreement with removing the pod resolving logic, but I think this was intentionally put there for a specific use case, so I'm more interested in why this change is necessary in the first place.

Is there anything in the log about the getent on the announce service address failing?

Dear @DandyDeveloper

Here is some logs and metrics when using this helm chart in a cluster with multiple releases

strn -n prod-redis-1 --tail 1000 r-1 -c redis | grep -i master | grep 10.43
prod-redis-1-redis-server-1 1:S 12 May 2022 05:56:11.342 * Connecting to MASTER 10.43.98.141:6379
prod-redis-1-redis-server-1 1:S 12 May 2022 05:59:17.175 * Connecting to MASTER 10.43.185.164:6379

kubectl get svc -A | grep -E '10.43.98.141|10.43.185.164'
prod-redis-2                   prod-redis-2-redis-announce-1                      ClusterIP      10.43.98.141    <none>        6379/TCP,26379/TCP,9121/TCP                                   106d
prod-redis-1                   prod-redis-1-redis-announce-0                      ClusterIP      10.43.185.164   <none>        6379/TCP,26379/TCP,9121/TCP                                   107d

I can confirm that this PR can solve our problem

mhkarimi1383 avatar Jun 29 '22 07:06 mhkarimi1383

@DandyDeveloper Sorry for late response. The getent hosts command issues multiple (because of ndots:5 option) DNS queries to resolve the host. Theses queries may fail because of a busy network or other bugs (like this). These errors are rare (and there is no log of them), but sentinels keep track of the wrong IP (pod IP), and the bad scenario begins....

tahajahangir avatar Aug 09 '22 14:08 tahajahangir

@DandyDeveloper Can you review this?

tahajahangir avatar Aug 16 '22 15:08 tahajahangir

Catching up on this one, will come back to it later today.

DandyDeveloper avatar Aug 17 '22 01:08 DandyDeveloper

Turns out I had neglected this entirely. Tomorrow morning, first thing. This is top priority for me.

DandyDeveloper avatar Aug 22 '22 11:08 DandyDeveloper

@DandyDeveloper Any updates on this?

mhkarimi1383 avatar Aug 24 '22 04:08 mhkarimi1383

@tahajahangir I was going to patch the chart version to fix the merge changes but it looks like I can't access your fork.

Can you get this fixed and retrigger the CI? I want to make sure this works on the newest k8s versions as well.

DandyDeveloper avatar Aug 26 '22 01:08 DandyDeveloper

Rebased

tahajahangir avatar Aug 26 '22 03:08 tahajahangir