nats-server icon indicating copy to clipboard operation
nats-server copied to clipboard

When re-deploy nats-jetstream have problem `NO quorum, stalled.`

Open duc2h opened this issue 3 years ago • 27 comments

Defect

Make sure that these boxes are checked before submitting your issue -- thank you!

  • [ ] Included nats-server -DV output
  • [ ] Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)

Versions of nats-server and affected client libraries used:

2.6.6-alpine3.14

OS/Container environment:

Steps or code to reproduce the issue:

re-deploy nats cluster

Expected result:

deploy success

Actual result:

2021-12-08 16:45:15.695 ICT[1] 2021/12/08 09:45:15.694914 [WRN] JetStream cluster consumer 'A > syncuserregistration > durable-sync-staff' has NO quorum, stalled.
Error
2021-12-08 16:45:16.904 ICT[1] 2021/12/08 09:45:16.904527 [WRN] JetStream cluster stream 'A > studenteventlogs' has NO quorum, stalled.
Error
2021-12-08 16:45:18.250 ICT[1] 2021/12/08 09:45:18.249982 [WRN] JetStream cluster consumer 'A > eurekastudentevent > durable-eureka-student-event-created' has NO quorum, stalled.
Error
2021-12-08 16:45:18.414 ICT[1] 2021/12/08 09:45:18.414104 [WRN] JetStream cluster consumer 'A > syncmasterregistration > durable-sync-class' has NO quorum, stalled.
Error
2021-12-08 16:45:18.565 ICT[1] 2021/12/08 09:45:18.564943 [WRN] JetStream cluster stream 'A > syncmasterregistration' has NO quorum, stalled.
Error
2021-12-08 16:45:19.506 ICT[1] 2021/12/08 09:45:19.506590 [WRN] JetStream cluster stream 'A > activitylog' has NO quorum, stalled.
Error
2021-12-08 16:45:19.717 ICT[1] 2021/12/08 09:45:19.717403 [WRN] JetStream cluster consumer 'A > learningobjectives > durable-learning-objectives-created' has NO quorum, stalled.
Error
2021-12-08 16:45:20.278 ICT[1] 2021/12/08 09:45:20.278274 [WRN] JetStream cluster consumer 'A > studenteventlogs > durable-student-event-logs-created' has NO quorum, stalled.
Error
2021-12-08 16:45:20.506 ICT[1] 2021/12/08 09:45:20.505963 [WRN] JetStream cluster consumer 'A > syncuserregistration > durable-log-payload' has NO quorum, stalled.
Error
2021-12-08 16:45:20.687 ICT[1] 2021/12/08 09:45:20.686837 [WRN] JetStream cluster consumer 'A > syncmasterregistration > durable-sync-academic-year' has NO quorum, stalled.
Error
2021-12-08 16:45:20.805 ICT[1] 2021/12/08 09:45:20.805577 [WRN] JetStream cluster consumer 'A > cloudconvertjobevent > durable-cloud-convert' has NO quorum, stalled.
Error
2021-12-08 16:45:22.363 ICT[1] 2021/12/08 09:45:22.363649 [WRN] JetStream cluster consumer 'A > activitylog > durable-activity-log-created' has NO quorum, stalled.
Error
2021-12-08 16:45:22.601 ICT[1] 2021/12/08 09:45:22.601333 [WRN] JetStream cluster stream 'A > learningobjectives' has NO quorum, stalled.
Error
2021-12-08 16:45:23.257 ICT[1] 2021/12/08 09:45:23.257704 [WRN] JetStream cluster consumer 'A > assignstudyplan > durable-assign-study-plan' has NO quorum, stalled.
Error
2021-12-08 16:45:24.043 ICT[1] 2021/12/08 09:45:24.043701 [WRN] JetStream cluster stream 'A > syncusercourse' has NO quorum, stalled.
Error
2021-12-08 16:45:24.807 ICT[1] 2021/12/08 09:45:24.807032 [WRN] JetStream cluster stream 'A > chatmessage' has NO quorum, stalled.
Error
2021-12-08 16:45:24.989 ICT[1] 2021/12/08 09:45:24.989169 [WRN] JetStream cluster stream 'A > assignstudyplan' has NO quorum, stalled.
Error
2021-12-08 16:45:26.588 ICT[1] 2021/12/08 09:45:26.588135 [WRN] JetStream cluster consumer 'A > studentpackage > durable-student-package' has NO quorum, stalled.
Error
2021-12-08 16:45:28.324 ICT[1] 2021/12/08 09:45:28.323815 [WRN] JetStream cluster consumer 'A > syncmasterregistration > durable-sync-course-class' has NO quorum, stalled.

duc2h avatar Dec 08 '21 09:12 duc2h

Can you give details about how many servers in your cluster? What is the replication factor of the streams and consumers?

derekcollison avatar Dec 08 '21 22:12 derekcollison

I've ran in to this issue with a 5-node cluster. I can't seem to reproduce this issue consistently, but this issue seemed to be correlated with an inconsistency in the reported cluster size from each server after restarting a node (see #2657).

We were unable to resolve this issue and get the cluster size to report the correct size consistently. Instead, we moved to a 3-node cluster and haven't had issues since.

rh2048 avatar Dec 13 '21 02:12 rh2048

We seem having same issue with same version.

Not sure what we can provide to debug?

nvcnvn avatar Dec 20 '21 12:12 nvcnvn

Same problem? Cluster Size: 3 Nats: 2.7.4 Using Nats Security with distributed JWT's

Updated to a new Version (2.7.4) from 2.6.6. Before upgrading a nats backup of the streams where performed, but now unable to restore the streams. Getting Error:

[WRN] JetStream cluster stream 'AD2XXTUQI453QTLRZYHP4O2NGKPUMI6T22MGKKUWADO3IS6W226NQZX7 > <stream>' has NO quorum, stalled

Check if the stream exist:

nats -s <server> --creds <credsFile> stream report
Obtaining Stream stats
No Streams defined

If I try to create the same stream again after the failed restore I'm getting this:

nats -s <server> --creds <credsFile> stream create <streamName>
? Subjects to consume <topic>.>
? Storage backend file
? Retention Policy Limits
? Discard Policy Old
? Stream Messages Limit -1
? Message size limit -1
? Maximum message age limit 3M
? Maximum individual message size -1
? Duplicate tracking time window 5m
? Replicas 2
nats: error: could not create Stream: malformed or corrupt message

tommylp avatar Mar 28 '22 09:03 tommylp

please do your create command with --trace and show the output

ripienaar avatar Mar 28 '22 10:03 ripienaar

12:20:23 >>> $JS.API.STREAM.CREATE.<stream>
{"name":"<stream>","subjects":["<topic>.\u003e\u003e"],"retention":"limits","max_consumers":-1,"max_msgs":-1,"max_bytes":-1,"max_age":7776000000000000,"max_msg_size":-1,"storage":"file","discard":"old","num_replicas":2,"duplicate_window":300000000000}

12:20:23 <<< $JS.API.STREAM.CREATE.<stream>
{"type":"io.nats.jetstream.api.v1.stream_create_response","error":{"code":500,"err_code":10049,"description":"malformed or corrupt message"}}

nats: error: could not create Stream: malformed or corrupt message

Setting the replication count to: 1 will create the stream

tommylp avatar Mar 28 '22 10:03 tommylp

your subject appears to end in foo.>> can only have 1.

assuming you have a system account use that and show nats server list and nats report jsz

ripienaar avatar Mar 28 '22 10:03 ripienaar

Rerun with 1 >

nats -s <server> --creds <credsFile> stream create <stream> --config stream.config --trace
12:25:47 >>> $JS.API.STREAM.CREATE.<stream>
{"name":"<stream>","subjects":["<topic>.\u003e"],"retention":"limits","max_consumers":-1,"max_msgs":-1,"max_bytes":-1,"max_age":7776000000000000,"max_msg_size":-1,"storage":"file","discard":"old","num_replicas":2,"duplicate_window":300000000000}

12:25:47 <<< $JS.API.STREAM.CREATE.<stream>
{"type":"io.nats.jetstream.api.v1.stream_create_response","error":{"code":500,"err_code":10049,"description":"malformed or corrupt message"}}

nats: error: could not create Stream: malformed or corrupt message

tommylp avatar Mar 28 '22 10:03 tommylp

Content of config

{
  "name": "<stream>",
  "subjects": [
    "<topic>.\u003e"
  ],
  "retention": "limits",
  "max_consumers": -1,
  "max_msgs": -1,
  "max_bytes": -1,
  "max_age": 7776000000000000,
  "max_msg_size": -1,
  "storage": "file",
  "discard": "old",
  "num_replicas": 2,
  "duplicate_window": 300000000000
}

tommylp avatar Mar 28 '22 10:03 tommylp

server list

+-----------------------------------------------------------------------------------------------------------------------------------+
|                                                          Server Overview                                                          |
+-------------+------------+-----------+---------+-----+-------+------+--------+-----+--------+-----+------+----------+-------------+
| Name        | Cluster    | IP        | Version | JS  | Conns | Subs | Routes | GWs | Mem    | CPU | Slow | Uptime   | RTT         |
+-------------+------------+-----------+---------+-----+-------+------+--------+-----+--------+-----+------+----------+-------------+
| nats-core-1 | nats-core  | 0.0.0.0   | 2.7.4   | yes | 0     | 302  | 2      | 0   | 17 MiB | 0.0 | 0    | 27m22s   | 71.283698ms |
| nats-core-0 | nats-core  | 0.0.0.0   | 2.7.4   | yes | 14    | 366  | 2      | 0   | 26 MiB | 0.0 | 0    | 1h59m34s | 71.246344ms |
| nats-core-2 | nats-core  | 0.0.0.0   | 2.7.4   | yes | 5     | 328  | 2      | 0   | 17 MiB | 0.0 | 0    | 29m51s   | 71.193031ms |
+-------------+------------+-----------+---------+-----+-------+------+--------+-----+--------+-----+------+----------+-------------+
|             | 1 Clusters | 3 Servers |         | 3   | 19    | 996  |        |     | 60 MiB |     | 0    |          |             |
+-------------+------------+-----------+---------+-----+-------+------+--------+-----+--------+-----+------+----------+-------------+

+------------------------------------------------------------------------------+
|                               Cluster Overview                               |
+-----------+------------+-------------------+-------------------+-------------+
| Cluster   | Node Count | Outgoing Gateways | Incoming Gateways | Connections |
+-----------+------------+-------------------+-------------------+-------------+
| nats-core | 3          | 0                 | 0                 | 19          |
+-----------+------------+-------------------+-------------------+-------------+
|           | 3          | 0                 | 0                 | 19          |
+-----------+------------+-------------------+-------------------+-------------+

tommylp avatar Mar 28 '22 10:03 tommylp

server report jsz:

+-------------------------------------------------------------------------------------------------------+
|                                           JetStream Summary                                           |
+--------------+-----------+---------+-----------+----------+-------+--------+------+---------+---------+
| Server       | Cluster   | Streams | Consumers | Messages | Bytes | Memory | File | API Req | API Err |
+--------------+-----------+---------+-----------+----------+-------+--------+------+---------+---------+
| nats-core-2* | nats-core | 0       | 0         | 0        | 0 B   | 0 B    | 0 B  | 8       | 7       |
| nats-core-0  | nats-core | 0       | 0         | 0        | 0 B   | 0 B    | 0 B  | 2       | 0       |
| nats-core-1  | nats-core | 0       | 0         | 0        | 0 B   | 0 B    | 0 B  | 6       | 0       |
+--------------+-----------+---------+-----------+----------+-------+--------+------+---------+---------+
|              |           | 0       | 0         | 0        | 0 B   | 0 B    | 0 B  | 16      | 7       |
+--------------+-----------+---------+-----------+----------+-------+--------+------+---------+---------+

+--------------------------------------------------------+
|              RAFT Meta Group Information               |
+-------------+--------+---------+--------+--------+-----+
| Name        | Leader | Current | Online | Active | Lag |
+-------------+--------+---------+--------+--------+-----+
| nats-core-0 |        | true    | true   | 0.22s  | 0   |
| nats-core-1 |        | true    | true   | 0.22s  | 0   |
| nats-core-2 | yes    | true    | true   | 0.00s  | 0   |
+-------------+--------+---------+--------+--------+-----+

tommylp avatar Mar 28 '22 10:03 tommylp

so if you just change your config to replicas 1 it works? (config is valid now)

ripienaar avatar Mar 28 '22 10:03 ripienaar

Yes, no problem.

nats -s <server> --creds <credsFile> stream create <stream> --config stream.config --trace
12:32:41 >>> $JS.API.STREAM.CREATE.<stream>
{"name":"<stream>","subjects":["<topic>.\u003e"],"retention":"limits","max_consumers":-1,"max_msgs":-1,"max_bytes":-1,"max_age":7776000000000000,"max_msg_size":-1,"storage":"file","discard":"old","num_replicas":1,"duplicate_window":300000000000}

12:32:41 <<< $JS.API.STREAM.CREATE.<stream>
{"type":"io.nats.jetstream.api.v1.stream_create_response","config":{"name":"<stream>","subjects":["<topic>.\u003e"],"retention":"limits","max_consumers":-1,"max_msgs":-1,"max_bytes":-1,"max_age":7776000000000000,"max_msgs_per_subject":-1,"max_msg_size":-1,"discard":"old","storage":"file","num_replicas":1,"duplicate_window":300000000000,"sealed":false,"deny_delete":false,"deny_purge":false,"allow_rollup_hdrs":false},"created":"2022-03-28T10:32:41.197592339Z","state":{"messages":0,"bytes":0,"first_seq":0,"first_ts":"0001-01-01T00:00:00Z","last_seq":0,"last_ts":"0001-01-01T00:00:00Z","consumer_count":0},"cluster":{"name":"nats-core","leader":"nats-core-1"},"did_create":true}

Stream <stream> was created

Information for Stream <stream> created 2022-03-28T12:32:41+02:00

Configuration:

             Subjects: <topic>.>
     Acknowledgements: true
            Retention: File - Limits
             Replicas: 1
       Discard Policy: Old
     Duplicate Window: 5m0s
     Maximum Messages: unlimited
        Maximum Bytes: unlimited
          Maximum Age: 90d0h0m0s
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited


Cluster Information:

                 Name: nats-core
               Leader: nats-core-1

State:

             Messages: 0
                Bytes: 0 B
             FirstSeq: 0
              LastSeq: 0
     Active Consumers: 0

tommylp avatar Mar 28 '22 10:03 tommylp

Unfortunately setting the num_replicas to 1 in the backup.json file did not solve the problem.

nats: error: restore failed: malformed or corrupt message

tommylp avatar Mar 28 '22 10:03 tommylp

Do I need to update the value inside the base64 configuration also? But that will probably not work, because of a change of the checksum value.

tommylp avatar Mar 28 '22 10:03 tommylp

How is the checksum created?

{
  "type": "stream",
  "time": "2022-03-28T08:27:07Z",
  "configuration": "<base64String>",
  "checksum": "31423daa92ee............"
}

tommylp avatar Mar 28 '22 10:03 tommylp

I think you can do —replicas when restoring rather than editing the file

ripienaar avatar Mar 28 '22 11:03 ripienaar

No --replicas flag that I can see. The nats stream restore command have a --config flag that can take a config file. But did now work either.

tommylp avatar Mar 28 '22 11:03 tommylp

Problem seems to be related to the JetStream Leader, evicting the leader or killing the leader pod i k8s so It moves to another instance gives me the possibility to create the stream with 2 replicas.

Doing the same for stream restore still does not work.

tommylp avatar Mar 28 '22 12:03 tommylp

Related issue: https://github.com/nats-io/nats-server/issues/2845

tommylp avatar Mar 29 '22 06:03 tommylp

Also experiencing such problem on nats v2.7.4 with: 3-node cluster, 3-replicas per stream 10k subjects 10k push consumers (one per subjects) spread among 10 to 20 streams (it doesn't matter).

After cluster restart a lot of consumers (but not all) has no quorum and become stalled.

ajax-lizogubenko-s avatar Apr 08 '22 07:04 ajax-lizogubenko-s

Problem seems to be related to the JetStream Leader, evicting the leader or killing the leader pod i k8s so It moves to another instance gives me the possibility to create the stream with 2 replicas.

Doing the same for stream restore still does not work.

We're facing the same error when we create new node pool and evict nats streaming.

What is the procedure to work around this? When this issue happen, our application cannot connect to nats, we only know to delete the stream and re-create, which cause some data lost.

nvcnvn avatar Apr 08 '22 12:04 nvcnvn

@sergiilizo and @nvcnvn we would most likely need to jump on a Zoom call to diagnose more thoroughly the situation.

derekcollison avatar Apr 08 '22 13:04 derekcollison

@sergiilizo and @nvcnvn we would most likely need to jump on a Zoom call to diagnose more thoroughly the situation.

Hi @derekcollison thanks for your great support, how should we arrange this?

nvcnvn avatar Apr 08 '22 14:04 nvcnvn

Shoot me an email, [email protected].

derekcollison avatar Apr 08 '22 14:04 derekcollison

@derekcollison

I received the same error for consumer stalled.

JetStream cluster consumer '$G > configuration > admin_CreateAdminUserCommand_firebase_CreateAdminUser' has NO quorum, stalled.
Healthcheck failed: "JetStream consumer '$G > configuration > admin_AdminUserCreatedEvent_firebase_AdminUserCreated' is not current"

This is a 5 node cluster and the stream was set to a replicas -> 3

Version 2.8.2 - k8s - attached pvc

I changed the replicas to 5 and the cluster became stable again.

Can you tell me how quorum is calculated for a consumer with a replica of 3 in a 5 node cluster? The only doc for quorum I could find was https://docs.nats.io/running-a-nats-service/configuration/clustering/jetstream_clustering#the-quorum.

If the same calc for quorum is 1/2 node +1, then I assume that quorum won't be reached if a node in the 5 node cluster drops that had the info on a consumer (replicas 3). Is this valid or I am off base?

cchatfield avatar May 17 '22 22:05 cchatfield

2.8.3 should be released tomorrow which hopefully helps out here.

Quorum calculation is N/2+1. So for R3 its 2, for R5 its 3.

derekcollison avatar May 17 '22 22:05 derekcollison

Closing for now but feel free to re-open as needed.

derekcollison avatar Jan 06 '23 16:01 derekcollison

Unfortunately setting the num_replicas to 1 in the backup.json file did not solve the problem.

@tommylp, you should just change that property on existent stream (without backup-restore settings):

nats stream edit <STREAM_NAME> --replicas 3

When this issue happen, our application cannot connect to nats, we only know to delete the stream and re-create, which cause some data lost.

@nvcnvn, instead of deleting the entire stream just try to updating replicas value in it:

nats stream edit <STREAM_NAME> --replicas 1
nats stream edit <STREAM_NAME> --replicas 3

It would re-create replicas on the available servers.

osmanovv avatar Mar 13 '23 14:03 osmanovv