nats-server
nats-server copied to clipboard
Message count in nats stream ls VS localhost:8222/jsz monitoring endpoint
Nats 2.8.4, 3 node jetstream cluster
node 1
nats stream ls
reports all streams have 0 messages
curl localhost:8222/jsz?consumers=1
stream s1 reports "messages": 631
<--- ?!
after nats restart nats.log
has entry Restored 631 messages for stream 'USERS > s1'
ls {store_dir}/USERS/streams/s1/msgs
also indicates a 4609.blk
file dated a few days ago
node 2
nats stream ls
reports all streams have 0 messages
curl localhost:8222/jsz?consumers=1
stream s1 reports "messages": 0
node 3
nats stream ls
reports all streams have 0 messages
curl localhost:8222/jsz?consumers=1
stream s1 reports "messages": 0
Even though stream s1 has Retention: File - Interest, in principle there should not be any (many) messages as they are consumed and removed as soon as they appear.
nats stream info s1
:
Information for Stream s1 created 2022-07-13T20:07:45Z
Configuration:
Subjects: s1.>
Acknowledgements: true
Retention: File - Interest
Replicas: 3
Discard Policy: Old
Duplicate Window: 2m0s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Maximum Messages: unlimited
Maximum Bytes: 1.0 GiB
Maximum Age: 5d0h0m0s
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: nats-cluster
Leader: n1
Replica: n2, current, seen 0.01s ago
Replica: n3, current, seen 0.01s ago
State:
Messages: 0
Bytes: 0 B
FirstSeq: 42,409,845
LastSeq: 42,409,844 @ 2022-08-15T15:18:00 UTC
Active Consumers: 2
nats stream view
doesn't show anything older than a few seconds
What is the recommended way of getting back to a consistent state at this point?
For what it's worth after nats stream view s1
(reporting nothing but "Reached apparent end of data") the messages count in curl localhost:8222/jsz?consumers=1
cleared out, as well as in the store_dir
... :|
FYI this keeps happening every now and then. Seems to correlate with "JetStream cluster no metadata leader", "JetStream cluster stream 'USERS > s1' has NO quorum, stalled" etc log events.
What is happening with the cluster such that the meta-leader drops? That should not happen under normal operation.
I realise this isn't terribly useful feedback, sorry, but we haven't got much in other logs. Services themselves (ones using this stream and ones not using NATS at all) seem to be running fine, databases no errors. If it helps we are on AWS EC2. Running from nats docker image but --network host --pid host. JS store_dir is on EBS (like our databases).
Cluster routes in conf use internal A records eg routes: ["nats://ns1.internal:6222", "nats://ns2.internal:6222", "nats://ns3.internal:6222"]. Each record maps to a single distinct IP.
We've got pretty the same issue from time to time after leader changes (network problems?). stream info
, stream ls
show 0 messages on every node. But metrics on 1 non-leader node show more messages (they are never get consumed), even after restarting all 3 nodes - counter is not reseted.
curl 'http://localhost:8222/jsz?streams=true' | jq -r '.account_details[].stream_detail[]| "\(.name) \(.state.messages)"'
Prometheus metrics look like this:
jetstream_stream_total_messages{account="$G", cluster="nats", instance="nats-0", is_meta_leader="false", is_stream_leader="true", job="nats", meta_leader="nats-2", server_id="http://10.0.0.1:8222", server_name="nats-0", stream_leader="nats-0", stream_name="tasks"} 0
jetstream_stream_total_messages{account="$G", cluster="nats", instance="nats-1", is_meta_leader="false", is_stream_leader="false", job="nats", meta_leader="nats-2", server_id="http://10.0.0.2:8222", server_name="nats-1", stream_leader="nats-0", stream_name="tasks"} 17
jetstream_stream_total_messages{account="$G", cluster="nats", instance="nats-2", is_meta_leader="true", is_stream_leader="false", job="nats", meta_leader="nats-2", server_id="http://10.0.0.3:8222", server_name="nats-2", stream_leader="nats-0", stream_name="tasks"} 0
Stream info:
Information for Stream tasks created 2022-04-22T12:59:25Z
Configuration:
Subjects: tasks
Acknowledgements: true
Retention: File - WorkQueue
Replicas: 3
Discard Policy: Old
Duplicate Window: 2m0s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Maximum Messages: unlimited
Maximum Bytes: unlimited
Maximum Age: unlimited
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: nats
Leader: nats-0
Replica: nats-1, current, seen 0.69s ago
Replica: nats-2, current, seen 0.69s ago
State:
Messages: 0
Bytes: 0 B
FirstSeq: 13,531,033
LastSeq: 13,531,032 @ 2022-08-30T09:05:14 UTC
Active Consumers: 1
After several tries of:
nats stream cluster step-down tasks
when the node with non-zero messages becomes a leader, nats stream info tasks
now shows non-zero messages
Information for Stream tasks created 2022-04-22T12:59:25Z
Configuration:
Subjects: tasks
Acknowledgements: true
Retention: File - WorkQueue
Replicas: 3
Discard Policy: Old
Duplicate Window: 2m0s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Maximum Messages: unlimited
Maximum Bytes: unlimited
Maximum Age: unlimited
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: nats
Leader: nats-1
Replica: nats-0, current, seen 0.27s ago
Replica: nats-2, current, seen 0.27s ago
State:
Messages: 17
Bytes: 27 KiB
FirstSeq: 13,360,589 @ 2022-08-28T17:38:06 UTC
LastSeq: 13,533,300 @ 2022-08-30T09:36:44 UTC
Deleted Messages: 172695
Active Consumers: 1
And these messages are never get consumed, even after restarting consumers, while all new messages are ok (e.g. count grows to 18, new message get consumed, count is 17 again).
The only fix we found is recreating the stream. Is there any simpler solution? Is this the counter problem, or these messages are not consumed and will never be (lost).
@urusha sounds very similar to what we are experiencing indeed. I don't have a solution but a workaround we use is nats stream view tasks
. If there are bunch of messages and the nats viewer is paging results, just select No to cancel and that clears out the messages.
PS sometimes this command seems to leave a temporary consumer hanging on that stream accumulating even more messages on the stream :) then just nats consumer rm tasks
(choose the offending consumer) to delete that temporary consumer and all is back to normal. Works for us YMMV.
Issue update. We are now on nats server 2.9.1 and nats cli 0.0.34.
0 messages via CLI. While number of messages via curl localhost:8222
.
nats stream report && nats stream info logs && nats consumer report logs && nats consumer info logs c1 && curl localhost:8222/jsz?consumers=1
Obtaining Stream stats
╭──────────────────────────────────────────────────────────────────────────────────────────────╮
│ Stream Report │
├──────────┬─────────┬───────────┬───────────┬──────────┬───────┬──────┬─────────┬─────────────┤
│ Stream │ Storage │ Placement │ Consumers │ Messages │ Bytes │ Lost │ Deleted │ Replicas │
├──────────┼─────────┼───────────┼───────────┼──────────┼───────┼──────┼─────────┼─────────────┤
│ ... │ File │ │ 1 │ 0 │ 0 B │ 0 │ 0 │ n1*, n2, n3 │
│ logs │ File │ │ 2 │ 0 │ 0 B │ 0 │ 0 │ n1*, n2, n3 │
╰──────────┴─────────┴───────────┴───────────┴──────────┴───────┴──────┴─────────┴─────────────╯
Information for Stream logs created 2022-07-13 20:07:45
Subjects: logs.>
Replicas: 3
Storage: File
Options:
Retention: Interest
Acknowledgements: true
Discard Policy: Old
Duplicate Window: 2m0s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Limits:
Maximum Messages: unlimited
Maximum Per Subject: unlimited
Maximum Bytes: 1.0 GiB
Maximum Age: 5d0h0m0s
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: nats-cluster
Leader: n1
Replica: n2, current, seen 0.10s ago
Replica: n3, current, seen 0.10s ago
State:
Messages: 0
Bytes: 0 B
FirstSeq: 90,650,977
LastSeq: 90,650,976 @ 2022-09-27T07:53:29 UTC
Active Consumers: 2
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Consumer report for logs with 2 consumers │
├───────────┬──────┬────────────┬──────────┬─────────────┬─────────────┬─────────────┬────────────┬─────────────┤
│ Consumer │ Mode │ Ack Policy │ Ack Wait │ Ack Pending │ Redelivered │ Unprocessed │ Ack Floor │ Cluster │
├───────────┼──────┼────────────┼──────────┼─────────────┼─────────────┼─────────────┼────────────┼─────────────┤
│ c2 │ Push │ Explicit │ 30.00s │ 0 │ 0 │ 0 │ 90,650,976 │ n1*, n2, n3 │
│ c1 │ Push │ Explicit │ 30.00s │ 0 │ 0 │ 0 │ 90,650,976 │ n1*, n2, n3 │
╰───────────┴──────┴────────────┴──────────┴─────────────┴─────────────┴─────────────┴────────────┴─────────────╯
Information for Consumer logs > c1 created 2022-09-23T13:21:10Z
Configuration:
Durable Name: c1
Delivery Subject: c1.logs
Deliver Policy: All
Deliver Queue Group: c1
Ack Policy: Explicit
Ack Wait: 30s
Replay Policy: Instant
Max Ack Pending: 1,000
Flow Control: false
Cluster Information:
Name: nats-cluster
Leader: n1
Replica: n2, current, seen 0.04s ago
Replica: n3, current, seen 0.04s ago
State:
Last Delivered Message: Consumer sequence: 90,655,130 Stream sequence: 90,650,977 Last delivery: 0.05s ago
Acknowledgment floor: Consumer sequence: 90,655,130 Stream sequence: 90,650,977 Last Ack: 0.04s ago
Outstanding Acks: 0 out of maximum 1,000
Redelivered Messages: 0
Unprocessed Messages: 0
Active Interest: Active using Queue Group c1
{
"server_id": "NCCJQ7RBYCMAZPOD5PWEQGJKREAK6SCZZYE36CAZGPTWGXCSR3LKCAM6",
"now": "2022-09-27T07:53:30.817377233Z",
"config": {
"max_memory": 1073741824,
"max_storage": 10737418240,
"store_dir": "/natsdata/jetstream",
"compress_ok": true
},
"memory": 0,
"storage": 3319,
"reserved_memory": 0,
"reserved_storage": 5368709120,
"accounts": 1,
"ha_assets": 12,
"api": {
"total": 6,
"errors": 0
},
"streams": 5,
"consumers": 6,
"messages": 4,
"bytes": 3319,
"meta_cluster": {
"name": "nats-cluster",
"leader": "n3",
"peer": "PCIUZRPn",
"cluster_size": 3
},
"account_details": [
{
"name": "USERS",
"id": "USERS",
"memory": 0,
"storage": 5051,
"reserved_memory": 18446744073709551615,
"reserved_storage": 18446744073709551615,
"accounts": 0,
"ha_assets": 0,
"api": {
"total": 86,
"errors": 2
},
"stream_detail": [
...,
{
"name": "logs",
"cluster": {
"name": "nats-cluster",
"leader": "n1",
"replicas": [
{
"name": "n1",
"current": true,
"active": 65687391,
"lag": 56270029,
"peer": "0XwpHHIx"
},
{
"name": "n3",
"current": false,
"active": 57142604217790,
"lag": 56270029,
"peer": "PCIUZRPn"
}
]
},
"state": {
"messages": 4,
"bytes": 3319,
"first_seq": 89934011,
"first_ts": "2022-09-26T16:00:57.667566887Z",
"last_seq": 90650977,
"last_ts": "2022-09-27T07:53:30.752516108Z",
"num_subjects": 3,
"num_deleted": 716963,
"consumer_count": 2
},
"consumer_detail": [
{
"stream_name": "logs",
"name": "c2",
"created": "2022-09-23T13:30:51.348276829Z",
"delivered": {
"consumer_seq": 86706344,
"stream_seq": 89907779,
"last_active": "2022-09-27T07:53:30.752760301Z"
},
"ack_floor": {
"consumer_seq": 86706344,
"stream_seq": 86702758,
"last_active": "2022-09-27T07:53:30.753759861Z"
},
"num_ack_pending": 0,
"num_redelivered": 0,
"num_waiting": 0,
"num_pending": 710433,
"cluster": {
"name": "nats-cluster",
"leader": "n1",
"replicas": [
{
"name": "n1",
"current": true,
"active": 63731012,
"lag": 110882275,
"peer": "0XwpHHIx"
},
{
"name": "n3",
"current": false,
"active": 57142594877123,
"lag": 110882275,
"peer": "PCIUZRPn"
}
]
},
"push_bound": true
},
{
"stream_name": "logs",
"name": "c1",
"created": "2022-09-23T13:30:51.34837326Z",
"delivered": {
"consumer_seq": 86707542,
"stream_seq": 89934046,
"last_active": "2022-09-27T07:53:30.752997658Z"
},
"ack_floor": {
"consumer_seq": 86707542,
"stream_seq": 86702767,
"last_active": "2022-09-27T07:53:30.753864074Z"
},
"num_ack_pending": 0,
"num_redelivered": 0,
"num_waiting": 0,
"num_pending": 710431,
"cluster": {
"name": "nats-cluster",
"leader": "n1",
"replicas": [
{
"name": "n1",
"current": true,
"active": 66541942,
"lag": 108397454,
"peer": "0XwpHHIx"
},
{
"name": "n3",
"current": false,
"active": 57150206277950,
"lag": 108397454,
"peer": "PCIUZRPn"
}
]
},
"push_bound": true
}
]
}
]
}
]
}
@Kaarel I believe that the jsz output makes sense only when inspecting the leader (for a given asset). @ripienaar can correct me if I am wrong.
Correct, only the leader can be trusted