nats-server
nats-server copied to clipboard
When re-deploy nats-jetstream have problem `NO quorum, stalled.`
Defect
Make sure that these boxes are checked before submitting your issue -- thank you!
- [ ] Included
nats-server -DV
output - [ ] Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)
Versions of nats-server
and affected client libraries used:
2.6.6-alpine3.14
OS/Container environment:
Steps or code to reproduce the issue:
re-deploy nats cluster
Expected result:
deploy success
Actual result:
2021-12-08 16:45:15.695 ICT[1] 2021/12/08 09:45:15.694914 [WRN] JetStream cluster consumer 'A > syncuserregistration > durable-sync-staff' has NO quorum, stalled.
Error
2021-12-08 16:45:16.904 ICT[1] 2021/12/08 09:45:16.904527 [WRN] JetStream cluster stream 'A > studenteventlogs' has NO quorum, stalled.
Error
2021-12-08 16:45:18.250 ICT[1] 2021/12/08 09:45:18.249982 [WRN] JetStream cluster consumer 'A > eurekastudentevent > durable-eureka-student-event-created' has NO quorum, stalled.
Error
2021-12-08 16:45:18.414 ICT[1] 2021/12/08 09:45:18.414104 [WRN] JetStream cluster consumer 'A > syncmasterregistration > durable-sync-class' has NO quorum, stalled.
Error
2021-12-08 16:45:18.565 ICT[1] 2021/12/08 09:45:18.564943 [WRN] JetStream cluster stream 'A > syncmasterregistration' has NO quorum, stalled.
Error
2021-12-08 16:45:19.506 ICT[1] 2021/12/08 09:45:19.506590 [WRN] JetStream cluster stream 'A > activitylog' has NO quorum, stalled.
Error
2021-12-08 16:45:19.717 ICT[1] 2021/12/08 09:45:19.717403 [WRN] JetStream cluster consumer 'A > learningobjectives > durable-learning-objectives-created' has NO quorum, stalled.
Error
2021-12-08 16:45:20.278 ICT[1] 2021/12/08 09:45:20.278274 [WRN] JetStream cluster consumer 'A > studenteventlogs > durable-student-event-logs-created' has NO quorum, stalled.
Error
2021-12-08 16:45:20.506 ICT[1] 2021/12/08 09:45:20.505963 [WRN] JetStream cluster consumer 'A > syncuserregistration > durable-log-payload' has NO quorum, stalled.
Error
2021-12-08 16:45:20.687 ICT[1] 2021/12/08 09:45:20.686837 [WRN] JetStream cluster consumer 'A > syncmasterregistration > durable-sync-academic-year' has NO quorum, stalled.
Error
2021-12-08 16:45:20.805 ICT[1] 2021/12/08 09:45:20.805577 [WRN] JetStream cluster consumer 'A > cloudconvertjobevent > durable-cloud-convert' has NO quorum, stalled.
Error
2021-12-08 16:45:22.363 ICT[1] 2021/12/08 09:45:22.363649 [WRN] JetStream cluster consumer 'A > activitylog > durable-activity-log-created' has NO quorum, stalled.
Error
2021-12-08 16:45:22.601 ICT[1] 2021/12/08 09:45:22.601333 [WRN] JetStream cluster stream 'A > learningobjectives' has NO quorum, stalled.
Error
2021-12-08 16:45:23.257 ICT[1] 2021/12/08 09:45:23.257704 [WRN] JetStream cluster consumer 'A > assignstudyplan > durable-assign-study-plan' has NO quorum, stalled.
Error
2021-12-08 16:45:24.043 ICT[1] 2021/12/08 09:45:24.043701 [WRN] JetStream cluster stream 'A > syncusercourse' has NO quorum, stalled.
Error
2021-12-08 16:45:24.807 ICT[1] 2021/12/08 09:45:24.807032 [WRN] JetStream cluster stream 'A > chatmessage' has NO quorum, stalled.
Error
2021-12-08 16:45:24.989 ICT[1] 2021/12/08 09:45:24.989169 [WRN] JetStream cluster stream 'A > assignstudyplan' has NO quorum, stalled.
Error
2021-12-08 16:45:26.588 ICT[1] 2021/12/08 09:45:26.588135 [WRN] JetStream cluster consumer 'A > studentpackage > durable-student-package' has NO quorum, stalled.
Error
2021-12-08 16:45:28.324 ICT[1] 2021/12/08 09:45:28.323815 [WRN] JetStream cluster consumer 'A > syncmasterregistration > durable-sync-course-class' has NO quorum, stalled.
Can you give details about how many servers in your cluster? What is the replication factor of the streams and consumers?
I've ran in to this issue with a 5-node cluster. I can't seem to reproduce this issue consistently, but this issue seemed to be correlated with an inconsistency in the reported cluster size from each server after restarting a node (see #2657).
We were unable to resolve this issue and get the cluster size to report the correct size consistently. Instead, we moved to a 3-node cluster and haven't had issues since.
We seem having same issue with same version.
Not sure what we can provide to debug?
Same problem? Cluster Size: 3 Nats: 2.7.4 Using Nats Security with distributed JWT's
Updated to a new Version (2.7.4) from 2.6.6. Before upgrading a nats backup of the streams where performed, but now unable to restore the streams. Getting Error:
[WRN] JetStream cluster stream 'AD2XXTUQI453QTLRZYHP4O2NGKPUMI6T22MGKKUWADO3IS6W226NQZX7 > <stream>' has NO quorum, stalled
Check if the stream exist:
nats -s <server> --creds <credsFile> stream report
Obtaining Stream stats
No Streams defined
If I try to create the same stream again after the failed restore I'm getting this:
nats -s <server> --creds <credsFile> stream create <streamName>
? Subjects to consume <topic>.>
? Storage backend file
? Retention Policy Limits
? Discard Policy Old
? Stream Messages Limit -1
? Message size limit -1
? Maximum message age limit 3M
? Maximum individual message size -1
? Duplicate tracking time window 5m
? Replicas 2
nats: error: could not create Stream: malformed or corrupt message
please do your create command with --trace and show the output
12:20:23 >>> $JS.API.STREAM.CREATE.<stream>
{"name":"<stream>","subjects":["<topic>.\u003e\u003e"],"retention":"limits","max_consumers":-1,"max_msgs":-1,"max_bytes":-1,"max_age":7776000000000000,"max_msg_size":-1,"storage":"file","discard":"old","num_replicas":2,"duplicate_window":300000000000}
12:20:23 <<< $JS.API.STREAM.CREATE.<stream>
{"type":"io.nats.jetstream.api.v1.stream_create_response","error":{"code":500,"err_code":10049,"description":"malformed or corrupt message"}}
nats: error: could not create Stream: malformed or corrupt message
Setting the replication count to: 1 will create the stream
your subject appears to end in foo.>>
can only have 1.
assuming you have a system account use that and show nats server list
and nats report jsz
Rerun with 1 >
nats -s <server> --creds <credsFile> stream create <stream> --config stream.config --trace
12:25:47 >>> $JS.API.STREAM.CREATE.<stream>
{"name":"<stream>","subjects":["<topic>.\u003e"],"retention":"limits","max_consumers":-1,"max_msgs":-1,"max_bytes":-1,"max_age":7776000000000000,"max_msg_size":-1,"storage":"file","discard":"old","num_replicas":2,"duplicate_window":300000000000}
12:25:47 <<< $JS.API.STREAM.CREATE.<stream>
{"type":"io.nats.jetstream.api.v1.stream_create_response","error":{"code":500,"err_code":10049,"description":"malformed or corrupt message"}}
nats: error: could not create Stream: malformed or corrupt message
Content of config
{
"name": "<stream>",
"subjects": [
"<topic>.\u003e"
],
"retention": "limits",
"max_consumers": -1,
"max_msgs": -1,
"max_bytes": -1,
"max_age": 7776000000000000,
"max_msg_size": -1,
"storage": "file",
"discard": "old",
"num_replicas": 2,
"duplicate_window": 300000000000
}
server list
+-----------------------------------------------------------------------------------------------------------------------------------+
| Server Overview |
+-------------+------------+-----------+---------+-----+-------+------+--------+-----+--------+-----+------+----------+-------------+
| Name | Cluster | IP | Version | JS | Conns | Subs | Routes | GWs | Mem | CPU | Slow | Uptime | RTT |
+-------------+------------+-----------+---------+-----+-------+------+--------+-----+--------+-----+------+----------+-------------+
| nats-core-1 | nats-core | 0.0.0.0 | 2.7.4 | yes | 0 | 302 | 2 | 0 | 17 MiB | 0.0 | 0 | 27m22s | 71.283698ms |
| nats-core-0 | nats-core | 0.0.0.0 | 2.7.4 | yes | 14 | 366 | 2 | 0 | 26 MiB | 0.0 | 0 | 1h59m34s | 71.246344ms |
| nats-core-2 | nats-core | 0.0.0.0 | 2.7.4 | yes | 5 | 328 | 2 | 0 | 17 MiB | 0.0 | 0 | 29m51s | 71.193031ms |
+-------------+------------+-----------+---------+-----+-------+------+--------+-----+--------+-----+------+----------+-------------+
| | 1 Clusters | 3 Servers | | 3 | 19 | 996 | | | 60 MiB | | 0 | | |
+-------------+------------+-----------+---------+-----+-------+------+--------+-----+--------+-----+------+----------+-------------+
+------------------------------------------------------------------------------+
| Cluster Overview |
+-----------+------------+-------------------+-------------------+-------------+
| Cluster | Node Count | Outgoing Gateways | Incoming Gateways | Connections |
+-----------+------------+-------------------+-------------------+-------------+
| nats-core | 3 | 0 | 0 | 19 |
+-----------+------------+-------------------+-------------------+-------------+
| | 3 | 0 | 0 | 19 |
+-----------+------------+-------------------+-------------------+-------------+
server report jsz:
+-------------------------------------------------------------------------------------------------------+
| JetStream Summary |
+--------------+-----------+---------+-----------+----------+-------+--------+------+---------+---------+
| Server | Cluster | Streams | Consumers | Messages | Bytes | Memory | File | API Req | API Err |
+--------------+-----------+---------+-----------+----------+-------+--------+------+---------+---------+
| nats-core-2* | nats-core | 0 | 0 | 0 | 0 B | 0 B | 0 B | 8 | 7 |
| nats-core-0 | nats-core | 0 | 0 | 0 | 0 B | 0 B | 0 B | 2 | 0 |
| nats-core-1 | nats-core | 0 | 0 | 0 | 0 B | 0 B | 0 B | 6 | 0 |
+--------------+-----------+---------+-----------+----------+-------+--------+------+---------+---------+
| | | 0 | 0 | 0 | 0 B | 0 B | 0 B | 16 | 7 |
+--------------+-----------+---------+-----------+----------+-------+--------+------+---------+---------+
+--------------------------------------------------------+
| RAFT Meta Group Information |
+-------------+--------+---------+--------+--------+-----+
| Name | Leader | Current | Online | Active | Lag |
+-------------+--------+---------+--------+--------+-----+
| nats-core-0 | | true | true | 0.22s | 0 |
| nats-core-1 | | true | true | 0.22s | 0 |
| nats-core-2 | yes | true | true | 0.00s | 0 |
+-------------+--------+---------+--------+--------+-----+
so if you just change your config to replicas 1 it works? (config is valid now)
Yes, no problem.
nats -s <server> --creds <credsFile> stream create <stream> --config stream.config --trace
12:32:41 >>> $JS.API.STREAM.CREATE.<stream>
{"name":"<stream>","subjects":["<topic>.\u003e"],"retention":"limits","max_consumers":-1,"max_msgs":-1,"max_bytes":-1,"max_age":7776000000000000,"max_msg_size":-1,"storage":"file","discard":"old","num_replicas":1,"duplicate_window":300000000000}
12:32:41 <<< $JS.API.STREAM.CREATE.<stream>
{"type":"io.nats.jetstream.api.v1.stream_create_response","config":{"name":"<stream>","subjects":["<topic>.\u003e"],"retention":"limits","max_consumers":-1,"max_msgs":-1,"max_bytes":-1,"max_age":7776000000000000,"max_msgs_per_subject":-1,"max_msg_size":-1,"discard":"old","storage":"file","num_replicas":1,"duplicate_window":300000000000,"sealed":false,"deny_delete":false,"deny_purge":false,"allow_rollup_hdrs":false},"created":"2022-03-28T10:32:41.197592339Z","state":{"messages":0,"bytes":0,"first_seq":0,"first_ts":"0001-01-01T00:00:00Z","last_seq":0,"last_ts":"0001-01-01T00:00:00Z","consumer_count":0},"cluster":{"name":"nats-core","leader":"nats-core-1"},"did_create":true}
Stream <stream> was created
Information for Stream <stream> created 2022-03-28T12:32:41+02:00
Configuration:
Subjects: <topic>.>
Acknowledgements: true
Retention: File - Limits
Replicas: 1
Discard Policy: Old
Duplicate Window: 5m0s
Maximum Messages: unlimited
Maximum Bytes: unlimited
Maximum Age: 90d0h0m0s
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: nats-core
Leader: nats-core-1
State:
Messages: 0
Bytes: 0 B
FirstSeq: 0
LastSeq: 0
Active Consumers: 0
Unfortunately setting the num_replicas
to 1 in the backup.json file did not solve the problem.
nats: error: restore failed: malformed or corrupt message
Do I need to update the value inside the base64 configuration
also? But that will probably not work, because of a change of the checksum value.
How is the checksum created?
{
"type": "stream",
"time": "2022-03-28T08:27:07Z",
"configuration": "<base64String>",
"checksum": "31423daa92ee............"
}
I think you can do —replicas when restoring rather than editing the file
No --replicas
flag that I can see. The nats stream restore
command have a --config
flag that can take a config file. But did now work either.
Problem seems to be related to the JetStream Leader, evicting the leader or killing the leader pod i k8s so It moves to another instance gives me the possibility to create the stream with 2 replicas.
Doing the same for stream restore still does not work.
Related issue: https://github.com/nats-io/nats-server/issues/2845
Also experiencing such problem on nats v2.7.4 with: 3-node cluster, 3-replicas per stream 10k subjects 10k push consumers (one per subjects) spread among 10 to 20 streams (it doesn't matter).
After cluster restart a lot of consumers (but not all) has no quorum and become stalled.
Problem seems to be related to the JetStream Leader, evicting the leader or killing the leader pod i k8s so It moves to another instance gives me the possibility to create the stream with 2 replicas.
Doing the same for stream restore still does not work.
We're facing the same error when we create new node pool and evict nats streaming.
What is the procedure to work around this? When this issue happen, our application cannot connect to nats, we only know to delete the stream and re-create, which cause some data lost.
@sergiilizo and @nvcnvn we would most likely need to jump on a Zoom call to diagnose more thoroughly the situation.
@sergiilizo and @nvcnvn we would most likely need to jump on a Zoom call to diagnose more thoroughly the situation.
Hi @derekcollison thanks for your great support, how should we arrange this?
Shoot me an email, [email protected].
@derekcollison
I received the same error for consumer stalled.
JetStream cluster consumer '$G > configuration > admin_CreateAdminUserCommand_firebase_CreateAdminUser' has NO quorum, stalled.
Healthcheck failed: "JetStream consumer '$G > configuration > admin_AdminUserCreatedEvent_firebase_AdminUserCreated' is not current"
This is a 5 node cluster and the stream was set to a replicas -> 3
Version 2.8.2 - k8s - attached pvc
I changed the replicas to 5 and the cluster became stable again.
Can you tell me how quorum is calculated for a consumer with a replica of 3 in a 5 node cluster? The only doc for quorum I could find was https://docs.nats.io/running-a-nats-service/configuration/clustering/jetstream_clustering#the-quorum.
If the same calc for quorum is 1/2 node +1, then I assume that quorum won't be reached if a node in the 5 node cluster drops that had the info on a consumer (replicas 3). Is this valid or I am off base?
2.8.3 should be released tomorrow which hopefully helps out here.
Quorum calculation is N/2+1. So for R3 its 2, for R5 its 3.
Closing for now but feel free to re-open as needed.
Unfortunately setting the
num_replicas
to 1 in the backup.json file did not solve the problem.
@tommylp, you should just change that property on existent stream (without backup-restore settings):
nats stream edit <STREAM_NAME> --replicas 3
When this issue happen, our application cannot connect to nats, we only know to delete the stream and re-create, which cause some data lost.
@nvcnvn, instead of deleting the entire stream just try to updating replicas
value in it:
nats stream edit <STREAM_NAME> --replicas 1
nats stream edit <STREAM_NAME> --replicas 3
It would re-create replicas on the available servers.