nats-server icon indicating copy to clipboard operation
nats-server copied to clipboard

Message count in nats stream ls VS localhost:8222/jsz monitoring endpoint

Open Kaarel opened this issue 2 years ago • 7 comments

Nats 2.8.4, 3 node jetstream cluster

node 1 nats stream ls reports all streams have 0 messages curl localhost:8222/jsz?consumers=1 stream s1 reports "messages": 631 <--- ?! after nats restart nats.log has entry Restored 631 messages for stream 'USERS > s1' ls {store_dir}/USERS/streams/s1/msgs also indicates a 4609.blk file dated a few days ago

node 2 nats stream ls reports all streams have 0 messages curl localhost:8222/jsz?consumers=1 stream s1 reports "messages": 0

node 3 nats stream ls reports all streams have 0 messages curl localhost:8222/jsz?consumers=1 stream s1 reports "messages": 0

Even though stream s1 has Retention: File - Interest, in principle there should not be any (many) messages as they are consumed and removed as soon as they appear.

nats stream info s1:

Information for Stream s1 created 2022-07-13T20:07:45Z

Configuration:

             Subjects: s1.>
     Acknowledgements: true
            Retention: File - Interest
             Replicas: 3
       Discard Policy: Old
     Duplicate Window: 2m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false
     Maximum Messages: unlimited
        Maximum Bytes: 1.0 GiB
          Maximum Age: 5d0h0m0s
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited


Cluster Information:

                 Name: nats-cluster
               Leader: n1
              Replica: n2, current, seen 0.01s ago
              Replica: n3, current, seen 0.01s ago

State:

             Messages: 0
                Bytes: 0 B
             FirstSeq: 42,409,845
              LastSeq: 42,409,844 @ 2022-08-15T15:18:00 UTC
     Active Consumers: 2

nats stream view doesn't show anything older than a few seconds

What is the recommended way of getting back to a consistent state at this point?

Kaarel avatar Aug 15 '22 15:08 Kaarel

For what it's worth after nats stream view s1 (reporting nothing but "Reached apparent end of data") the messages count in curl localhost:8222/jsz?consumers=1 cleared out, as well as in the store_dir ... :|

Kaarel avatar Aug 17 '22 12:08 Kaarel

FYI this keeps happening every now and then. Seems to correlate with "JetStream cluster no metadata leader", "JetStream cluster stream 'USERS > s1' has NO quorum, stalled" etc log events.

Kaarel avatar Aug 23 '22 11:08 Kaarel

What is happening with the cluster such that the meta-leader drops? That should not happen under normal operation.

derekcollison avatar Aug 23 '22 13:08 derekcollison

I realise this isn't terribly useful feedback, sorry, but we haven't got much in other logs. Services themselves (ones using this stream and ones not using NATS at all) seem to be running fine, databases no errors. If it helps we are on AWS EC2. Running from nats docker image but --network host --pid host. JS store_dir is on EBS (like our databases).

Kaarel avatar Aug 23 '22 14:08 Kaarel

Cluster routes in conf use internal A records eg routes: ["nats://ns1.internal:6222", "nats://ns2.internal:6222", "nats://ns3.internal:6222"]. Each record maps to a single distinct IP.

Kaarel avatar Aug 23 '22 14:08 Kaarel

We've got pretty the same issue from time to time after leader changes (network problems?). stream info, stream ls show 0 messages on every node. But metrics on 1 non-leader node show more messages (they are never get consumed), even after restarting all 3 nodes - counter is not reseted. curl 'http://localhost:8222/jsz?streams=true' | jq -r '.account_details[].stream_detail[]| "\(.name) \(.state.messages)"'

Prometheus metrics look like this:

jetstream_stream_total_messages{account="$G", cluster="nats", instance="nats-0", is_meta_leader="false", is_stream_leader="true", job="nats", meta_leader="nats-2", server_id="http://10.0.0.1:8222", server_name="nats-0", stream_leader="nats-0", stream_name="tasks"} 0
jetstream_stream_total_messages{account="$G", cluster="nats", instance="nats-1", is_meta_leader="false", is_stream_leader="false", job="nats", meta_leader="nats-2", server_id="http://10.0.0.2:8222", server_name="nats-1", stream_leader="nats-0", stream_name="tasks"} 17
jetstream_stream_total_messages{account="$G", cluster="nats", instance="nats-2", is_meta_leader="true", is_stream_leader="false", job="nats", meta_leader="nats-2", server_id="http://10.0.0.3:8222", server_name="nats-2",  stream_leader="nats-0", stream_name="tasks"} 0

Stream info:

Information for Stream tasks created 2022-04-22T12:59:25Z

Configuration:

             Subjects: tasks
     Acknowledgements: true
            Retention: File - WorkQueue
             Replicas: 3
       Discard Policy: Old
     Duplicate Window: 2m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false
     Maximum Messages: unlimited
        Maximum Bytes: unlimited
          Maximum Age: unlimited
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited


Cluster Information:

                 Name: nats
               Leader: nats-0
              Replica: nats-1, current, seen 0.69s ago
              Replica: nats-2, current, seen 0.69s ago

State:

             Messages: 0
                Bytes: 0 B
             FirstSeq: 13,531,033
              LastSeq: 13,531,032 @ 2022-08-30T09:05:14 UTC
     Active Consumers: 1

After several tries of: nats stream cluster step-down tasks when the node with non-zero messages becomes a leader, nats stream info tasks now shows non-zero messages

Information for Stream tasks created 2022-04-22T12:59:25Z

Configuration:

             Subjects: tasks
     Acknowledgements: true
            Retention: File - WorkQueue
             Replicas: 3
       Discard Policy: Old
     Duplicate Window: 2m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false
     Maximum Messages: unlimited
        Maximum Bytes: unlimited
          Maximum Age: unlimited
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited


Cluster Information:

                 Name: nats
               Leader: nats-1
              Replica: nats-0, current, seen 0.27s ago
              Replica: nats-2, current, seen 0.27s ago

State:

             Messages: 17
                Bytes: 27 KiB
             FirstSeq: 13,360,589 @ 2022-08-28T17:38:06 UTC
              LastSeq: 13,533,300 @ 2022-08-30T09:36:44 UTC
     Deleted Messages: 172695
     Active Consumers: 1

And these messages are never get consumed, even after restarting consumers, while all new messages are ok (e.g. count grows to 18, new message get consumed, count is 17 again).

The only fix we found is recreating the stream. Is there any simpler solution? Is this the counter problem, or these messages are not consumed and will never be (lost).

urusha avatar Aug 30 '22 10:08 urusha

@urusha sounds very similar to what we are experiencing indeed. I don't have a solution but a workaround we use is nats stream view tasks. If there are bunch of messages and the nats viewer is paging results, just select No to cancel and that clears out the messages.

PS sometimes this command seems to leave a temporary consumer hanging on that stream accumulating even more messages on the stream :) then just nats consumer rm tasks (choose the offending consumer) to delete that temporary consumer and all is back to normal. Works for us YMMV.

Kaarel avatar Aug 30 '22 11:08 Kaarel

Issue update. We are now on nats server 2.9.1 and nats cli 0.0.34.

0 messages via CLI. While number of messages via curl localhost:8222.

nats stream report && nats stream info logs && nats consumer report logs && nats consumer info logs c1 && curl localhost:8222/jsz?consumers=1

Obtaining Stream stats

╭──────────────────────────────────────────────────────────────────────────────────────────────╮
│                             Stream Report                                                    │
├──────────┬─────────┬───────────┬───────────┬──────────┬───────┬──────┬─────────┬─────────────┤
│ Stream   │ Storage │ Placement │ Consumers │ Messages │ Bytes │ Lost │ Deleted │ Replicas    │
├──────────┼─────────┼───────────┼───────────┼──────────┼───────┼──────┼─────────┼─────────────┤
│ ...      │ File    │           │ 1         │ 0        │ 0 B   │ 0    │ 0       │ n1*, n2, n3 │
│ logs     │ File    │           │ 2         │ 0        │ 0 B   │ 0    │ 0       │ n1*, n2, n3 │
╰──────────┴─────────┴───────────┴───────────┴──────────┴───────┴──────┴─────────┴─────────────╯

Information for Stream logs created 2022-07-13 20:07:45

             Subjects: logs.>
             Replicas: 3
              Storage: File

Options:

            Retention: Interest
     Acknowledgements: true
       Discard Policy: Old
     Duplicate Window: 2m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false

Limits:

     Maximum Messages: unlimited
  Maximum Per Subject: unlimited
        Maximum Bytes: 1.0 GiB
          Maximum Age: 5d0h0m0s
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited


Cluster Information:

                 Name: nats-cluster
               Leader: n1
              Replica: n2, current, seen 0.10s ago
              Replica: n3, current, seen 0.10s ago

State:

             Messages: 0
                Bytes: 0 B
             FirstSeq: 90,650,977
              LastSeq: 90,650,976 @ 2022-09-27T07:53:29 UTC
     Active Consumers: 2

╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                  Consumer report for logs with 2 consumers                                    │
├───────────┬──────┬────────────┬──────────┬─────────────┬─────────────┬─────────────┬────────────┬─────────────┤
│ Consumer  │ Mode │ Ack Policy │ Ack Wait │ Ack Pending │ Redelivered │ Unprocessed │ Ack Floor  │ Cluster     │
├───────────┼──────┼────────────┼──────────┼─────────────┼─────────────┼─────────────┼────────────┼─────────────┤
│ c2        │ Push │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 90,650,976 │ n1*, n2, n3 │
│ c1        │ Push │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 90,650,976 │ n1*, n2, n3 │
╰───────────┴──────┴────────────┴──────────┴─────────────┴─────────────┴─────────────┴────────────┴─────────────╯

Information for Consumer logs > c1 created 2022-09-23T13:21:10Z

Configuration:

        Durable Name: c1
    Delivery Subject: c1.logs
      Deliver Policy: All
 Deliver Queue Group: c1
          Ack Policy: Explicit
            Ack Wait: 30s
       Replay Policy: Instant
     Max Ack Pending: 1,000
        Flow Control: false

Cluster Information:

                Name: nats-cluster
              Leader: n1
             Replica: n2, current, seen 0.04s ago
             Replica: n3, current, seen 0.04s ago

State:

   Last Delivered Message: Consumer sequence: 90,655,130 Stream sequence: 90,650,977 Last delivery: 0.05s ago
     Acknowledgment floor: Consumer sequence: 90,655,130 Stream sequence: 90,650,977 Last Ack: 0.04s ago
         Outstanding Acks: 0 out of maximum 1,000
     Redelivered Messages: 0
     Unprocessed Messages: 0
          Active Interest: Active using Queue Group c1
{
  "server_id": "NCCJQ7RBYCMAZPOD5PWEQGJKREAK6SCZZYE36CAZGPTWGXCSR3LKCAM6",
  "now": "2022-09-27T07:53:30.817377233Z",
  "config": {
    "max_memory": 1073741824,
    "max_storage": 10737418240,
    "store_dir": "/natsdata/jetstream",
    "compress_ok": true
  },
  "memory": 0,
  "storage": 3319,
  "reserved_memory": 0,
  "reserved_storage": 5368709120,
  "accounts": 1,
  "ha_assets": 12,
  "api": {
    "total": 6,
    "errors": 0
  },
  "streams": 5,
  "consumers": 6,
  "messages": 4,
  "bytes": 3319,
  "meta_cluster": {
    "name": "nats-cluster",
    "leader": "n3",
    "peer": "PCIUZRPn",
    "cluster_size": 3
  },
  "account_details": [
    {
      "name": "USERS",
      "id": "USERS",
      "memory": 0,
      "storage": 5051,
      "reserved_memory": 18446744073709551615,
      "reserved_storage": 18446744073709551615,
      "accounts": 0,
      "ha_assets": 0,
      "api": {
        "total": 86,
        "errors": 2
      },
      "stream_detail": [
        ...,
        {
          "name": "logs",
          "cluster": {
            "name": "nats-cluster",
            "leader": "n1",
            "replicas": [
              {
                "name": "n1",
                "current": true,
                "active": 65687391,
                "lag": 56270029,
                "peer": "0XwpHHIx"
              },
              {
                "name": "n3",
                "current": false,
                "active": 57142604217790,
                "lag": 56270029,
                "peer": "PCIUZRPn"
              }
            ]
          },
          "state": {
            "messages": 4,
            "bytes": 3319,
            "first_seq": 89934011,
            "first_ts": "2022-09-26T16:00:57.667566887Z",
            "last_seq": 90650977,
            "last_ts": "2022-09-27T07:53:30.752516108Z",
            "num_subjects": 3,
            "num_deleted": 716963,
            "consumer_count": 2
          },
          "consumer_detail": [
            {
              "stream_name": "logs",
              "name": "c2",
              "created": "2022-09-23T13:30:51.348276829Z",
              "delivered": {
                "consumer_seq": 86706344,
                "stream_seq": 89907779,
                "last_active": "2022-09-27T07:53:30.752760301Z"
              },
              "ack_floor": {
                "consumer_seq": 86706344,
                "stream_seq": 86702758,
                "last_active": "2022-09-27T07:53:30.753759861Z"
              },
              "num_ack_pending": 0,
              "num_redelivered": 0,
              "num_waiting": 0,
              "num_pending": 710433,
              "cluster": {
                "name": "nats-cluster",
                "leader": "n1",
                "replicas": [
                  {
                    "name": "n1",
                    "current": true,
                    "active": 63731012,
                    "lag": 110882275,
                    "peer": "0XwpHHIx"
                  },
                  {
                    "name": "n3",
                    "current": false,
                    "active": 57142594877123,
                    "lag": 110882275,
                    "peer": "PCIUZRPn"
                  }
                ]
              },
              "push_bound": true
            },
            {
              "stream_name": "logs",
              "name": "c1",
              "created": "2022-09-23T13:30:51.34837326Z",
              "delivered": {
                "consumer_seq": 86707542,
                "stream_seq": 89934046,
                "last_active": "2022-09-27T07:53:30.752997658Z"
              },
              "ack_floor": {
                "consumer_seq": 86707542,
                "stream_seq": 86702767,
                "last_active": "2022-09-27T07:53:30.753864074Z"
              },
              "num_ack_pending": 0,
              "num_redelivered": 0,
              "num_waiting": 0,
              "num_pending": 710431,
              "cluster": {
                "name": "nats-cluster",
                "leader": "n1",
                "replicas": [
                  {
                    "name": "n1",
                    "current": true,
                    "active": 66541942,
                    "lag": 108397454,
                    "peer": "0XwpHHIx"
                  },
                  {
                    "name": "n3",
                    "current": false,
                    "active": 57150206277950,
                    "lag": 108397454,
                    "peer": "PCIUZRPn"
                  }
                ]
              },
              "push_bound": true
            }
          ]
        }
      ]
    }
  ]
}

Kaarel avatar Sep 27 '22 08:09 Kaarel

@Kaarel I believe that the jsz output makes sense only when inspecting the leader (for a given asset). @ripienaar can correct me if I am wrong.

kozlovic avatar Sep 27 '22 15:09 kozlovic

Correct, only the leader can be trusted

ripienaar avatar Sep 27 '22 15:09 ripienaar