nats-operator icon indicating copy to clipboard operation
nats-operator copied to clipboard

NATS Cluster Pods Gone

Open Starefossen opened this issue 4 years ago • 3 comments

So this happend a couple of days ago in one of our environments. After a while all of the Pods in the NATS Operator controlled NATS Cluster was completely gone and NATS Operator was running nominally without errors.

Setup

  • Kubernetes v1.12.7-gke.25
  • NATS Operator v0.4.4
  • NATS Server v1.4.1
  • NATS Streaming v0.6.0

Events

Tiime Event
06:10:42 nats-cluster-1 and nats-cluster-3 lose connection with nats-cluster-2 (10.44.0.73) with error connect: no route to host
06:10:46 nats-operator realises that nats-cluster-2 is not working correctly: deleting pod "apps-test/nats-cluster-2" in terminal phase "Failed"
Lots and lots of no route to host messages
09:44:17  nats-cluster-1 lose connection with nats-cluster-3 (10.44.5.55) with error connect: no route to host
09:57:38 last log statement from nats-cluster-1: [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.apps-test.svc on 10.111.0.10:53: no such host
09:57:39 nats-cluster is completely offline since there are no pods

Observations

There are several observations:

  • ~NATS Server pods are disapearing for no good reasons and without any error messages. Neither from the pod itself nor from the operator.~ NATS Server pods were terminated by Chaoskube, and in all previous experiences the NATS Operator has done it's job to recreate the missing pods.
  • Since the operator is not using replicaset it is not possible for us to alert when NATS pods are missing since there exists no Kubernetes reference to how many should be there.
  • The operator does not export any metrics itself in order to get status for how it is performing or any errors it is encounting I have filed that as a seperate issue in #207

Starefossen avatar Aug 29 '19 13:08 Starefossen

Thank you for filing the issue - this is good information. We’ll take a look ASAP. At first glance It looks like the k8s NATS service (network) failed and it wasn't handled appropriately.

CC @wallyqs @variadico

ColinSullivan1 avatar Aug 29 '19 15:08 ColinSullivan1

Thanks for the report, sorry for the inconvenience, we're looking at this and into moving to using statefulsets internally as well within the operator instead of the current controller logic.

wallyqs avatar Aug 29 '19 18:08 wallyqs

First of, we are extremely happy with NATS – so thank you for the hard work and dedication! ❤️ Secondly, we have determined that the reason the NATS Server Pods was shutting down was due to them being terminated by Chaoskube a process that kills pods randomly. In all previous instances and we have had it running for a month the NATS Operator has done its job and re-created the missing pods – but not in this instance.

Starefossen avatar Aug 30 '19 07:08 Starefossen