charts
charts copied to clipboard
[stable/redis-ha] Fix using pod ip instead of service for announce-ip
What this PR does / why we need it:
Which issue this PR fixes
Root problem: Some times separate redis-ha installations hijack slaves from each-other. This occurs when a slave is registered by pod ip, not service-ip. And since pod ips are not fixed, the IP may be assigned to another (unfortunately redis) pod ....
A sample config-init container log for a bad pod, is as follows: (service cidr is 10.43.0.0/16, pod cidr is 10.42.0.0/16)
Mon May 30 05:25:09 UTC 2022..
Could not connect to Redis at myredis:26379: Try again
Could not connect to Redis at myredis:26379: Try again
Could not connect to Redis at myredis:26379: Try again
Mon May 30 05:25:54 UTC 2022 Did not find redis master ()
Identify announce ip for this pod..
using (myredis-announce-0) or (myredis-server-0)
identified announce (10.42.12.41)
Setting up defaults..
Checklist
- [x] DCO signed
- [x] Chart Version bumped
- [x] Title of the PR starts with chart name (e.g.
[stable/mychartname])
@DandyDeveloper Can you review this?
@tahajahangir
I'm just reminding myself on the code here but: https://github.com/DandyDeveloper/charts/blob/abb9d3eae00381ec65de3b2e50d809aa412dd175/charts/redis-ha/templates/_configs.tpl#L275-L287
We detect the announce IP here. Which gets the hosts (AKA Announce Service address) and only fallsback to the pod IP if the announce fails to resolve.
Isn't the bigger concern here why the initial host=$(getent hosts "${service}") is failing on your cluster with multiple releases?
I think I am in agreement with removing the pod resolving logic, but I think this was intentionally put there for a specific use case, so I'm more interested in why this change is necessary in the first place.
Is there anything in the log about the getent on the announce service address failing?
@tahajahangir
I'm just reminding myself on the code here but:
https://github.com/DandyDeveloper/charts/blob/abb9d3eae00381ec65de3b2e50d809aa412dd175/charts/redis-ha/templates/_configs.tpl#L275-L287
We detect the announce IP here. Which gets the hosts (AKA Announce Service address) and only fallsback to the pod IP if the announce fails to resolve.
Isn't the bigger concern here why the initial
host=$(getent hosts "${service}")is failing on your cluster with multiple releases?I think I am in agreement with removing the pod resolving logic, but I think this was intentionally put there for a specific use case, so I'm more interested in why this change is necessary in the first place.
Is there anything in the log about the
getenton the announce service address failing?
Dear @DandyDeveloper
Here is some logs and metrics when using this helm chart in a cluster with multiple releases
strn -n prod-redis-1 --tail 1000 r-1 -c redis | grep -i master | grep 10.43
prod-redis-1-redis-server-1 1:S 12 May 2022 05:56:11.342 * Connecting to MASTER 10.43.98.141:6379
prod-redis-1-redis-server-1 1:S 12 May 2022 05:59:17.175 * Connecting to MASTER 10.43.185.164:6379
kubectl get svc -A | grep -E '10.43.98.141|10.43.185.164'
prod-redis-2 prod-redis-2-redis-announce-1 ClusterIP 10.43.98.141 <none> 6379/TCP,26379/TCP,9121/TCP 106d
prod-redis-1 prod-redis-1-redis-announce-0 ClusterIP 10.43.185.164 <none> 6379/TCP,26379/TCP,9121/TCP 107d
I can confirm that this PR can solve our problem
@DandyDeveloper
Sorry for late response.
The getent hosts command issues multiple (because of ndots:5 option) DNS queries to resolve the host. Theses queries may fail because of a busy network or other bugs (like this). These errors are rare (and there is no log of them), but sentinels keep track of the wrong IP (pod IP), and the bad scenario begins....
@DandyDeveloper Can you review this?
Catching up on this one, will come back to it later today.
Turns out I had neglected this entirely. Tomorrow morning, first thing. This is top priority for me.
@DandyDeveloper Any updates on this?
@tahajahangir I was going to patch the chart version to fix the merge changes but it looks like I can't access your fork.
Can you get this fixed and retrigger the CI? I want to make sure this works on the newest k8s versions as well.
Rebased