mehdb icon indicating copy to clipboard operation
mehdb copied to clipboard

Can't get keys from leader

Open ramoncisternas opened this issue 6 years ago • 5 comments

Hello Michael,

I have just followed this example of yours to learn something about stateful sets in OpenShit and I found the follower pods can’t get keys from the leader due to a bad hostname construction (I think) compared to what DNS is able to resolve: mehdb-0.axa-partners-chatbot-hogar-preprod-axa-services-es instead of mehdb-0.mehdb.axa-partners-chatbot-hogar-preprod-axa-services-es.svc.cluster.local.

There you have the output of my test:

ramon@bionic-beaver:~/OpenShift $ oc get sts NAME DESIRED CURRENT AGE mehdb 3 3 34m

ramon@bionic-beaver:~/OpenShift $ oc scale sts mehdb --replicas=4 statefulset "mehdb" scaled

ramon@bionic-beaver:~/OpenShift $ oc get pods NAME READY STATUS RESTARTS AGE mehdb-0 1/1 Running 0 36m mehdb-1 1/1 Running 0 35m mehdb-2 1/1 Running 0 33m mehdb-3 1/1 Running 0 1m

ramon@bionic-beaver:~/OpenShift $ oc logs mehdb-3 2019/01/29 11:49:52 mehdb serving from mehdb-3:9876 using /mehdbdata as the data directory 2019/01/29 11:49:52 I am a follower shard, accepting READS 2019/01/29 11:50:02 Checking for new data from leader 2019/01/29 11:50:02 Can't get keys from leader due to Get http://mehdb-0.axa-partners-chatbot-hogar-preprod-axa-services-es:9876/keys: dial tcp: lookup mehdb-0.axa-partners-chatbot-hogar-preprod-axa-services-es on 10.64.9.9:53: no such host 2019/01/29 11:50:12 Checking for new data from leader 2019/01/29 11:50:12 Can't get keys from leader due to Get http://mehdb-0.axa-partners-chatbot-hogar-preprod-axa-services-es:9876/keys: dial tcp: lookup mehdb-0.axa-partners-chatbot-hogar-preprod-axa-services-es on 10.64.9.9:53: no such host

ramon@bionic-beaver:~/OpenShift $ oc run -i -t --rm dnscheck --restart=Never --image=quay.io/mhausenblas/jump:0.2 -- nslookup mehdb If you don't see a command prompt, try pressing enter. Name: mehdb Address 1: 10.94.107.130 mehdb-2.mehdb.axa-partners-chatbot-hogar-preprod-axa-services-es.svc.cluster.local Address 2: 10.94.112.4 mehdb-1.mehdb.axa-partners-chatbot-hogar-preprod-axa-services-es.svc.cluster.local Address 3: 10.94.21.232 mehdb-0.mehdb.axa-partners-chatbot-hogar-preprod-axa-services-es.svc.cluster.local Address 4: 10.94.9.228 mehdb-3.mehdb.axa-partners-chatbot-hogar-preprod-axa-services-es.svc.cluster.local

I wonder if it would be easy for you to explain the root cause of this error and suggest how it can be fixed.

Thank you in advance, Ramon Cisternas

ramoncisternas avatar Jan 29 '19 12:01 ramoncisternas

Thanks for raising this, @ramoncisternas … nothing that immediately comes to mind but could very well be a bug in my code. Will have a look ASAP.

mhausenblas avatar Jan 30 '19 06:01 mhausenblas

Hi Michael, I'm getting almost the same error has reported here above "Can't get keys from leader due to Get http://mehdb-0.default:9876/keys: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)".

I think this is a DNS issue or the way you construct the leader's URL. Since the leader can't be resolved as "mehdb-0.default", but instead works calling it as "mehdb-0.mehdb.default", I think the right way to construct its URL is pod_name.service.namespace.

So I would change this line of code

url := "http://" + leaderShard + "." + ns + ":" + port + "/keys"

to

url := "http://" + leaderShard + "." + "THE SERVICE NAME FROM YAML" + ":" + ns + ":" + port + "/keys"

Could you please verify the code at your end?

Thanks for the job you did so far, it helped me to demo the StatefulSet behaviour (a part from not being able to accept keys from the leader ;-) )

denismaggior8 avatar Sep 25 '19 21:09 denismaggior8

I have the aforementioned fix in place, ie. in main.go:

-	url := "http://" + leaderShard + "." + ns + ":" + port + "/keys"
+	url := "http://" + leaderShard + "." + "mehdb" + "." + ns + ":" + port + "/keys"

However I'm hitting another issue (open /mehdbdata/test/content: no such file or directory):

oc run -i -t --rm jumpod --restart=Never --image=quay.io/mhausenblas/jump:0.2 -- sh
If you don't see a command prompt, try pressing enter.
~ $ echo "test data" > /tmp/test
~ $ curl -sL -XPUT -T /tmp/test mehdb:9876/set/test
open /mehdbdata/test/content: no such file or directory

This looks like a permission issue:

oc exec -it mehdb-0 -- /bin/bash
bash-4.4$ ls -alh /mehdbdata/
total 12K
drwxr-xr-x. 2   99   99 4.0K Nov 15 05:24 .
drwxr-xr-x. 1 root root 4.0K Nov 15 13:34 ..
bash-4.4$ touch /mehdbdata/test
touch: cannot touch '/mehdbdata/test' Permission denied
bash-4.4$ ls -lZ /mehdbdata/ -d
drwxr-xr-x. 2 99 99 system_u:object_r:nfs_t:s0 4096 Nov 15 05:24 /mehdbdata/
bash-4.4$ id 2
uid=2(daemon) gid=2(daemon) groups=2(daemon)

jeffhoek avatar Nov 15 '19 17:11 jeffhoek

Thanks @jeffhoek! I'm a little unsure what you want me to do? I can't reproduce it.

mhausenblas avatar Nov 17 '19 09:11 mhausenblas

I was able to get it running on OpenShift 3.11, with the following steps:

oc create sa mehdb
oc adm policy add-scc-to-user privileged -z mehdb

then in app.yaml add the following to the spec:

      serviceAccountName: mehdb

jeffhoek avatar Nov 18 '19 20:11 jeffhoek