nats-operator NATS Cluster Pods Gone

NATS Cluster Pods Gone

Open Starefossen opened this issue 4 years ago • 3 comments

So this happend a couple of days ago in one of our environments. After a while all of the Pods in the NATS Operator controlled NATS Cluster was completely gone and NATS Operator was running nominally without errors.

Setup

Kubernetes v1.12.7-gke.25
NATS Operator v0.4.4
NATS Server v1.4.1
NATS Streaming v0.6.0

Events

Tiime	Event
06:10:42	`nats-cluster-1` and `nats-cluster-3` lose connection with `nats-cluster-2` (10.44.0.73) with error `connect: no route to host`
06:10:46	`nats-operator` realises that `nats-cluster-2` is not working correctly: `deleting pod "apps-test/nats-cluster-2" in terminal phase "Failed"`
	Lots and lots of no route to host messages
09:44:17	`nats-cluster-1` lose connection with `nats-cluster-3` (10.44.5.55) with error `connect: no route to host`
09:57:38	last log statement from nats-cluster-1: `[ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.apps-test.svc on 10.111.0.10:53: no such host`
09:57:39	nats-cluster is completely offline since there are no pods

Observations

There are several observations:

~NATS Server pods are disapearing for no good reasons and without any error messages. Neither from the pod itself nor from the operator.~ NATS Server pods were terminated by Chaoskube, and in all previous experiences the NATS Operator has done it's job to recreate the missing pods.
Since the operator is not using replicaset it is not possible for us to alert when NATS pods are missing since there exists no Kubernetes reference to how many should be there.
The operator does not export any metrics itself in order to get status for how it is performing or any errors it is encounting I have filed that as a seperate issue in #207

Aug 29 '19 13:08 Starefossen

Thank you for filing the issue - this is good information. We’ll take a look ASAP. At first glance It looks like the k8s NATS service (network) failed and it wasn't handled appropriately.

CC @wallyqs @variadico

Aug 29 '19 15:08 ColinSullivan1

Thanks for the report, sorry for the inconvenience, we're looking at this and into moving to using statefulsets internally as well within the operator instead of the current controller logic.

Aug 29 '19 18:08 wallyqs

First of, we are extremely happy with NATS – so thank you for the hard work and dedication! ❤️ Secondly, we have determined that the reason the NATS Server Pods was shutting down was due to them being terminated by Chaoskube a process that kills pods randomly. In all previous instances and we have had it running for a month the NATS Operator has done its job and re-created the missing pods – but not in this instance.

Aug 30 '19 07:08 Starefossen

nats-operator nats-operator copied to clipboard

NATS Cluster Pods Gone

Setup

Events

Observations

nats-operator
nats-operator copied to clipboard