kubernetes-neo4j
kubernetes-neo4j copied to clipboard
Pod is reported as healthy after Neo crashes.
This issue is related to #2. After one of the Neo4j services crashes, it is still reported as healthy, serving traffic and therefor not able to fulfil requests.
I tried to add a readinessProbe to the service but this will kill the cluster discovery as the service never registers.
containers:
- name: neo4j
image: "neo4j:3.3.0-enterprise"
imagePullPolicy: "IfNotPresent"
readinessProbe:
initialDelaySeconds: 600
httpGet:
path: /db/manage/server/core/available
port: 7474
What is the recommended path here to take crashed pods out or try to restart those?
@nonken Yeh I had exactly the issue you suggested with the readinessProbe so that's why there isn't one in the template. I'll look into this and see if I can come up with an automated way to have the pods restarted.
You could do a kubectl delete pod <pod-name>
in the meantime.
@nonken I tried putting in a livenessProbe:
livenessProbe:
initialDelaySeconds: 60
httpGet:
path: /db/manage/server/core/available
port: 7474
and then simulated one of the pods never joining by putting this code in the command section:
if [ `hostname -f` = "neo4j-core-2.neo4j.default.svc.cluster.local" ]
then
export NEO4J_causal__clustering_initial__discovery__members="foo:5000"
fi
And it seems to have kicked in:
$ kubectl logs neo4j-core-2 --previous
Starting Neo4j.
2018-01-08 13:39:13.765+0000 INFO ======== Neo4j 3.3.0 ========
2018-01-08 13:39:13.901+0000 INFO Starting...
2018-01-08 13:39:16.325+0000 INFO Bolt enabled on 0.0.0.0:7687.
2018-01-08 13:39:16.336+0000 INFO Initiating metrics...
2018-01-08 13:39:16.617+0000 INFO Resolved initial host 'foo:5000' to []
2018-01-08 13:39:16.651+0000 INFO My connection info: [
Discovery: listen=0.0.0.0:5000, advertised=neo4j-core-2.neo4j.default.svc.cluster.local:5000,
Transaction: listen=0.0.0.0:6000, advertised=neo4j-core-2.neo4j.default.svc.cluster.local:6000,
Raft: listen=0.0.0.0:7000, advertised=neo4j-core-2.neo4j.default.svc.cluster.local:7000,
Client Connector Addresses: bolt://neo4j-core-2.neo4j.default.svc.cluster.local:7687,http://neo4j-core-2.neo4j.default.svc.cluster.local:7474,https://neo4j-core-2.neo4j.default.svc.cluster.local:7473
]
2018-01-08 13:39:16.651+0000 INFO Discovering cluster with initial members: [foo:5000]
2018-01-08 13:39:16.652+0000 INFO Attempting to connect to the other cluster members before continuing...
2018-01-08 13:40:37.426+0000 INFO Neo4j Server shutdown initiated by request
Could you see if that works for your case. If so I can commit that into the repo.
@mneedham awesome, I will confirm whether this is working asap.
Regarding your suggestion on removing the pod: I have a load balancer sitting in front which a service talks to. What happened is that requests would just fail because whenever the LB would talk to the failed POD as it would be reported as healthy. Deleting the pod would fix
the issue but wouldn't really work in prod :)
@mneedham this took me a little longer. I can confirm that using the livenessProbe sort of works.
Now to the actual behaviour. My understanding is that the readynessProbe actually starts routing traffic to the Pod when it returns successfully. The livenessProbe simply restarts the probe if it is unhealthy. So really for running neo4j on production k8 we still want a readynessProbe. It seems like we have a chicken/egg situation here as the service is only ready once registered in the cluster, but it can only register in the cluster when ready.
It seems like we have a chicken/egg situation here as the service is only ready once registered in the cluster, but it can only register in the cluster when ready.
Yep! The problem with the readinessProbe
is that we really need it to wait on 7474
being available, but if we do that then the forming of the StatefulSet will stop with the first pod it tries to spin up because that will never be available on 7474
if a cluster (i.e. > 1 server) has been formed.
I've tried searching for a solution to this a few times but not found anything. If you have any better luck let me know.