nats-server icon indicating copy to clipboard operation
nats-server copied to clipboard

NATS cluster won't elect a new leader when streams created with different ReplicaCount setting

Open schroding3rscat opened this issue 3 years ago • 2 comments

Defect

nats-server -DV output:

# nats-server -DV
[12759] 2022/02/05 18:22:35.815756 [INF] Starting nats-server
[12759] 2022/02/05 18:22:35.815921 [INF]   Version:  2.6.6
[12759] 2022/02/05 18:22:35.815971 [INF]   Git:      [878afad]
[12759] 2022/02/05 18:22:35.815994 [DBG]   Go build: go1.16.10
[12759] 2022/02/05 18:22:35.816005 [INF]   Name:     NDAVV55EH2IBZA4OJRRNTFKJ2UVJFUXBY5FLRODS4XATB6GL22K5LV7I
[12759] 2022/02/05 18:22:35.816027 [INF]   ID:       NDAVV55EH2IBZA4OJRRNTFKJ2UVJFUXBY5FLRODS4XATB6GL22K5LV7I
[12759] 2022/02/05 18:22:35.816093 [DBG] Created system account: "$SYS"
[12759] 2022/02/05 18:22:35.818217 [FTL] Error listening on port: 0.0.0.0:4222, "listen tcp 0.0.0.0:4222: bind: address already in use"

Versions of nats-server and affected client libraries used:

Nats Server version: 2.6.6 Golang client library: v1.13.0

OS/Container environment:

VM with Debian 10.

Steps or code to reproduce the issue:

We had an interesting case in our NATS production cluster. Not sure how to describe this in steps, but I'll try.

  1. Our DevOps deployed a production JetStream cluster for our project. 5 machines, one leader elected. Everything is OK, we start using it.
  2. We created three streams: orders consuming orders.*, stocks consuming stocks.* etc. All these were created from the Go library with AddStream() method. The first mistake was placed here. We called the method with defaults and this produced streams bound to one replica:
╭────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                           Stream Report                                            │
├───────────┬─────────┬───────────┬────────────┬─────────┬──────┬─────────┬──────────────────────────┤
│ Stream    │ Storage │ Consumers │ Messages   │ Bytes   │ Lost │ Deleted │ Replicas                 │
├───────────┼─────────┼───────────┼────────────┼─────────┼──────┼─────────┼──────────────────────────┤
│ suppliers │ File    │ 6         │ 107        │ 5.6 KiB │ 0    │ 0       │ nats-js-prd-cl-1*        │
│ stocks    │ File    │ 21        │ 15,541,064 │ 7.1 GiB │ 0    │ 0       │ nats-js-prd-cl-2*        │
│ orders    │ File    │ 29        │ 82,374,815 │ 134 GiB │ 0    │ 0       │ nats-js-prd-cl-2*        │
╰───────────┴─────────┴───────────┴────────────┴─────────┴──────┴─────────┴──────────────────────────╯
  1. All is OK, messages flow fine. Fast-forward two months. We detect nats fault in production, DevOps connects to cluster, restarts machines and it works. During this event, he discovered that we created all streams bound to one replica. Recommends we recreate all these with ReplicaCount: 3-5 to actually use all cluster we have.
  2. We created three new streams (orders2, stocks2, suppliers2) with recommended settings. Plan to switch our code to them.
(report from the faulted cluster)
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                                     Stream Report                                                │
├────────────┬─────────┬───────────┬──────────┬─────────┬──────┬─────────┬─────────────────────────────────────────────────────────────────────────────────────────┤
│ Stream     │ Storage │ Consumers │ Messages │ Bytes   │ Lost │ Deleted │ Replicas                                                                                │
├────────────┼─────────┼───────────┼──────────┼─────────┼──────┼─────────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│ suppliers2 │ File    │ 0         │ 0        │ 0 B     │ 0    │ 0       │ nats-js-prd-cl-1.dc1, nats-js-prd-cl-1.dc2*, nats-js-prd-cl-1.dc3, nats-js-prd-cl-2.dc1 │
│ orders     │ File    │ 0         │ 0        │ 0 B     │ 0    │ 0       │ nats-js-prd-cl-2!                                                                        │
│ orders2    │ File    │ 0         │ 0        │ 0 B     │ 0    │ 0       │ nats-js-prd-cl-1.dc1, nats-js-prd-cl-1.dc2, nats-js-prd-cl-1.dc3, nats-js-prd-cl-2.dc1* │
│ stocks     │ File    │ 0         │ 0        │ 0 B     │ 0    │ 0       │ nats-js-prd-cl-2!                                                                        │
│ stocks2    │ File    │ 0         │ 0        │ 0 B     │ 0    │ 0       │ nats-js-prd-cl-1.dc1, nats-js-prd-cl-1.dc2, nats-js-prd-cl-1.dc3, nats-js-prd-cl-2.dc1* │
│ suppliers  │ File    │ 6         │ 103      │ 5.4 KiB │ 0    │ 0       │ nats-js-prd-cl-1*                                                                       │
╰────────────┴─────────┴───────────┴──────────┴─────────┴──────┴─────────┴─────────────────────────────────────────────────────────────────────────────────────────╯
  1. Before we made a switch, our dev writes a small Go script to test publish/subscribe settings. He pushes a message to a new subject inside the orders stream. And right after that, our cluster goes down. We wait about 30 minutes before any action with the hope that it will recover. No luck. After that our DevOps started to restart machines one by one, remove machines from the cluster but this won't help. It just won't elect a new leader and we see logs like this:
RAFT [lvzWYqj1 - S-R5F-hFmDFHmx] Sending out voteRequest {term:2240 lastTerm:2177 lastIndex:408 candidate:lvzWYqj1 reply:}
10.40.194.3:63692 - rid:2353 - Error flushing: writev tcp 10.20.195.169:5222->10.40.194.3:63692: writev: connect
10.40.194.3:63692 - rid:2353 - Router connection closed: Write Error
10.40.194.3:63698 - rid:2360 - Route connection created
10.40.194.3:5222 - rid:2361 - Router connection closed: Duplicate Route
JetStream cluster stream '$G > stocks2' has NO quorum, stalled.
JetStream cluster stream '$G > orders2' has NO quorum, stalled.

One thing helped election: we removed a node with all old streams which were bound to it exclusively (nats-js-prd-cl-2). This way we would lose all our data (not an option) but the cluster worked.

  1. After several hours of investigation, we found out that all logs with "no quorum" mention only new empty streams (orders2, etc). We remove them from the cluster while nats-js-prd-cl-2 restarted and recovered its streams. And voila! a leader is elected immediately and the cluster works again:
╭─────────────────────────────────────────────────────────────────────────────────────────────╮
│                                           Stream Report                                     │
├───────────┬─────────┬───────────┬────────────┬─────────┬──────┬─────────┬───────────────────┤
│ Stream    │ Storage │ Consumers │ Messages   │ Bytes   │ Lost │ Deleted │ Replicas          │
├───────────┼─────────┼───────────┼────────────┼─────────┼──────┼─────────┼───────────────────┤
│ suppliers │ File    │ 6         │ 107        │ 5.6 KiB │ 0    │ 0       │ nats-js-prd-cl-1* │
│ stocks    │ File    │ 21        │ 4,130,363  │ 1.5 GiB │ 0    │ 0       │ nats-js-prd-cl-2* │
│ orders    │ File    │ 29        │ 22,732,306 │ 30 GiB  │ 0    │ 0       │ nats-js-prd-cl-2* │
╰───────────┴─────────┴───────────┴────────────┴─────────┴──────┴─────────┴───────────────────╯

Expected result:

The cluster works with all streams with any config.

Actual result:

Cluster fails to elect a leader after pushing a new subject to a stream bound to one node.

schroding3rscat avatar Feb 05 '22 16:02 schroding3rscat

A couple of items, JetStream is moving fast so we ask that production systems as best they can stay current with releases, we are on 2.7.2.

Also, you can backup and restore a stream and in the process change the replication count. So that could have been an option to convert from R1 to R3. We will allow this operation via a stream update at some point as well.

derekcollison avatar Feb 05 '22 17:02 derekcollison

@schroding3rscat @derekcollison Leader Election issue has been fixed in 2.7.2

I have 3 nodes cluster with mqtt in Kubernetes. With 2.7.1 I was getting following error on losing one node

[WRN] JetStream cluster stream '$G > $MQTT_sess' has NO quorum, stalled

After upgrading my cluster to 2.7.2, Now it elect leader as per expectation.

One observation after scaling down cluster to 1 node from 3 it stop allowing new connections.

imranrazakhan avatar Feb 15 '22 16:02 imranrazakhan