nats-operator
nats-operator copied to clipboard
NATS Cluster Pods Gone
So this happend a couple of days ago in one of our environments. After a while all of the Pods in the NATS Operator controlled NATS Cluster was completely gone and NATS Operator was running nominally without errors.
Setup
- Kubernetes
v1.12.7-gke.25
- NATS Operator
v0.4.4
- NATS Server
v1.4.1
- NATS Streaming
v0.6.0
Events
Tiime | Event |
---|---|
06:10:42 | nats-cluster-1 and nats-cluster-3 lose connection with nats-cluster-2 (10.44.0.73) with error connect: no route to host |
06:10:46 | nats-operator realises that nats-cluster-2 is not working correctly: deleting pod "apps-test/nats-cluster-2" in terminal phase "Failed" |
Lots and lots of no route to host messages | |
09:44:17 | nats-cluster-1 lose connection with nats-cluster-3 (10.44.5.55) with error connect: no route to host |
09:57:38 | last log statement from nats-cluster-1: [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.apps-test.svc on 10.111.0.10:53: no such host |
09:57:39 | nats-cluster is completely offline since there are no pods |
Observations
There are several observations:
- ~NATS Server pods are disapearing for no good reasons and without any error messages. Neither from the pod itself nor from the operator.~ NATS Server pods were terminated by Chaoskube, and in all previous experiences the NATS Operator has done it's job to recreate the missing pods.
- Since the operator is not using replicaset it is not possible for us to alert when NATS pods are missing since there exists no Kubernetes reference to how many should be there.
- The operator does not export any metrics itself in order to get status for how it is performing or any errors it is encounting I have filed that as a seperate issue in #207
Thank you for filing the issue - this is good information. We’ll take a look ASAP. At first glance It looks like the k8s NATS service (network) failed and it wasn't handled appropriately.
CC @wallyqs @variadico
Thanks for the report, sorry for the inconvenience, we're looking at this and into moving to using statefulsets internally as well within the operator instead of the current controller logic.
First of, we are extremely happy with NATS – so thank you for the hard work and dedication! ❤️ Secondly, we have determined that the reason the NATS Server Pods was shutting down was due to them being terminated by Chaoskube a process that kills pods randomly. In all previous instances and we have had it running for a month the NATS Operator has done its job and re-created the missing pods – but not in this instance.