mehdb
mehdb copied to clipboard
Can't get keys from leader
Hello Michael,
I have just followed this example of yours to learn something about stateful sets in OpenShit and I found the follower pods can’t get keys from the leader due to a bad hostname construction (I think) compared to what DNS is able to resolve: mehdb-0.axa-partners-chatbot-hogar-preprod-axa-services-es instead of mehdb-0.mehdb.axa-partners-chatbot-hogar-preprod-axa-services-es.svc.cluster.local.
There you have the output of my test:
ramon@bionic-beaver:~/OpenShift $ oc get sts NAME DESIRED CURRENT AGE mehdb 3 3 34m
ramon@bionic-beaver:~/OpenShift $ oc scale sts mehdb --replicas=4 statefulset "mehdb" scaled
ramon@bionic-beaver:~/OpenShift $ oc get pods NAME READY STATUS RESTARTS AGE mehdb-0 1/1 Running 0 36m mehdb-1 1/1 Running 0 35m mehdb-2 1/1 Running 0 33m mehdb-3 1/1 Running 0 1m
ramon@bionic-beaver:~/OpenShift $ oc logs mehdb-3 2019/01/29 11:49:52 mehdb serving from mehdb-3:9876 using /mehdbdata as the data directory 2019/01/29 11:49:52 I am a follower shard, accepting READS 2019/01/29 11:50:02 Checking for new data from leader 2019/01/29 11:50:02 Can't get keys from leader due to Get http://mehdb-0.axa-partners-chatbot-hogar-preprod-axa-services-es:9876/keys: dial tcp: lookup mehdb-0.axa-partners-chatbot-hogar-preprod-axa-services-es on 10.64.9.9:53: no such host 2019/01/29 11:50:12 Checking for new data from leader 2019/01/29 11:50:12 Can't get keys from leader due to Get http://mehdb-0.axa-partners-chatbot-hogar-preprod-axa-services-es:9876/keys: dial tcp: lookup mehdb-0.axa-partners-chatbot-hogar-preprod-axa-services-es on 10.64.9.9:53: no such host
ramon@bionic-beaver:~/OpenShift $ oc run -i -t --rm dnscheck --restart=Never --image=quay.io/mhausenblas/jump:0.2 -- nslookup mehdb If you don't see a command prompt, try pressing enter. Name: mehdb Address 1: 10.94.107.130 mehdb-2.mehdb.axa-partners-chatbot-hogar-preprod-axa-services-es.svc.cluster.local Address 2: 10.94.112.4 mehdb-1.mehdb.axa-partners-chatbot-hogar-preprod-axa-services-es.svc.cluster.local Address 3: 10.94.21.232 mehdb-0.mehdb.axa-partners-chatbot-hogar-preprod-axa-services-es.svc.cluster.local Address 4: 10.94.9.228 mehdb-3.mehdb.axa-partners-chatbot-hogar-preprod-axa-services-es.svc.cluster.local
I wonder if it would be easy for you to explain the root cause of this error and suggest how it can be fixed.
Thank you in advance, Ramon Cisternas
Thanks for raising this, @ramoncisternas … nothing that immediately comes to mind but could very well be a bug in my code. Will have a look ASAP.
Hi Michael, I'm getting almost the same error has reported here above "Can't get keys from leader due to Get http://mehdb-0.default:9876/keys: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)".
I think this is a DNS issue or the way you construct the leader's URL. Since the leader can't be resolved as "mehdb-0.default", but instead works calling it as "mehdb-0.mehdb.default", I think the right way to construct its URL is pod_name.service.namespace.
So I would change this line of code
url := "http://" + leaderShard + "." + ns + ":" + port + "/keys"
to
url := "http://" + leaderShard + "." + "THE SERVICE NAME FROM YAML" + ":" + ns + ":" + port + "/keys"
Could you please verify the code at your end?
Thanks for the job you did so far, it helped me to demo the StatefulSet behaviour (a part from not being able to accept keys from the leader ;-) )
I have the aforementioned fix in place, ie. in main.go
:
- url := "http://" + leaderShard + "." + ns + ":" + port + "/keys"
+ url := "http://" + leaderShard + "." + "mehdb" + "." + ns + ":" + port + "/keys"
However I'm hitting another issue (open /mehdbdata/test/content: no such file or directory
):
oc run -i -t --rm jumpod --restart=Never --image=quay.io/mhausenblas/jump:0.2 -- sh
If you don't see a command prompt, try pressing enter.
~ $ echo "test data" > /tmp/test
~ $ curl -sL -XPUT -T /tmp/test mehdb:9876/set/test
open /mehdbdata/test/content: no such file or directory
This looks like a permission issue:
oc exec -it mehdb-0 -- /bin/bash
bash-4.4$ ls -alh /mehdbdata/
total 12K
drwxr-xr-x. 2 99 99 4.0K Nov 15 05:24 .
drwxr-xr-x. 1 root root 4.0K Nov 15 13:34 ..
bash-4.4$ touch /mehdbdata/test
touch: cannot touch '/mehdbdata/test' Permission denied
bash-4.4$ ls -lZ /mehdbdata/ -d
drwxr-xr-x. 2 99 99 system_u:object_r:nfs_t:s0 4096 Nov 15 05:24 /mehdbdata/
bash-4.4$ id 2
uid=2(daemon) gid=2(daemon) groups=2(daemon)
Thanks @jeffhoek! I'm a little unsure what you want me to do? I can't reproduce it.
I was able to get it running on OpenShift 3.11, with the following steps:
oc create sa mehdb
oc adm policy add-scc-to-user privileged -z mehdb
then in app.yaml add the following to the spec
:
serviceAccountName: mehdb