nats-server
nats-server copied to clipboard
Orphan messages in the stream
Observed behavior
I started a discussion on the Slack channel but opened an issue, too.
I am using NATS 2.10.12-alpine3.19. I have a stream like this below,
Information for Stream tfy-agent-state-buffer created 2024-05-20 14:19:40
Subjects: tfy-agent-state-buffer.>
Replicas: 3
Storage: File
Options:
Retention: WorkQueue
Acknowledgments: true
Discard Policy: Old
Duplicate Window: 15m0s
Direct Get: true
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: true
Limits:
Maximum Messages: unlimited
Maximum Per Subject: 20
Maximum Bytes: unlimited
Maximum Age: 1d0h0m0s
Maximum Message Size: 50 KiB
Maximum Consumers: unlimited
Cluster Information:
Name: nats
Leader: truefoundry-nats-2
Replica: truefoundry-nats-0, current, seen 445ms ago
Replica: truefoundry-nats-1, current, seen 445ms ago
State:
Messages: 1,891
Bytes: 956 KiB
First Sequence: 9,780,767 @ 2024-06-15 00:49:51 UTC
Last Sequence: 9,813,689 @ 2024-06-15 08:20:20 UTC
Deleted Messages: 31,032
Active Consumers: 173
Number of Subjects: 1,226
As you can see, 1891 messages are in the stream. If I look at filter subject tfy-agent-state-buffer.tfy-devtest-euwe1.>
❯ nats -s nats://localhost:4222 --creds prod-nats.creds stream subjects tfy-agent-state-buffer | grep tfy-devtest-euwe1
│ tfy-agent-state-buffer.tfy-devtest-euwe1.application.active-version.business-ws.customer-supslack-bot │ 1 │
│ tfy-agent-state-buffer.tfy-devtest-euwe1.job-run.status.truefoundcivo-usage-tracker-1718399700 │ 1 │
...
But the consumer for the above filter subject shows no pending messages.
❯ nats -s nats://localhost:4222 --creds prod-nats.creds consumer info tfy-agent-state-buffer cluster-tfy-devtest-euwe1
Information for Consumer tfy-agent-state-buffer > cluster-tfy-devtest-euwe1 created 2024-06-14T18:16:33+05:30
Configuration:
Name: cluster-tfy-devtest-euwe1
Pull Mode: true
Filter Subject: tfy-agent-state-buffer.tfy-devtest-euwe1.>
Deliver Policy: All
Ack Policy: Explicit
Ack Wait: 30.00s
Replay Policy: Instant
Maximum Deliveries: 2,880
Max Ack Pending: 1,000
Max Waiting Pulls: 512
Cluster Information:
Name: nats
Leader: truefoundry-nats-2
Replica: truefoundry-nats-0, current, seen 667ms ago
Replica: truefoundry-nats-1, current, seen 667ms ago
State:
Last Delivered Message: Consumer sequence: 132,187 Stream sequence: 9,813,760 Last delivery: 3m35s ago
Acknowledgment Floor: Consumer sequence: 132,187 Stream sequence: 9,813,698 Last Ack: 3m35s ago
Outstanding Acks: 0 out of maximum 1,000
Redelivered Messages: 0
Unprocessed Messages: 0
Waiting Pulls: 1 of maximum 512
- All these commands were executed by directly connecting to the stream and the consumer leader.
- We do not execute
NakorTermin our system on messages. - Even with the
consumer reportcommand, I do not see any pending or unprocessed messages on any consumer. - We use the subject rollups and dedup feature while publishing messages in the stream.
Our consumption flow looks something like this,
- We create or update the consumer for a filter subject.
a. We fetch max 50 messages from the consumer using
Fetch. b. We process these 50 messages serially and in order. c. After processing each message, we executeDoubleAck. We do not process messageNifN - 1is not double-acked successfully. d. Go back to (a) after the batch is processed completely.
In a different environment, we are running nats:2.10.14-alpine3.19
I am noticing the same issue there, too.
❯ nats -s ... --creds devtest-nats.creds stream get --last-for "tfy-agent-state-buffer.tfy-willk-usce.>"
? Select a Stream tfy-agent-state-buffer
Item: tfy-agent-state-buffer#990719 received 2024-06-15 00:38:11.357294174 +0000 UTC on Subject tfy-agent-state-buffer.tfy-willk-usce.application.active-version.tfy-agent.tfy-agent-cp
Headers:
Nats-Rollup: sub
Nats-Msg-Id: 343a147c7869f6291d2759731f4eb666c15ff76c
...
❯ nats -s ... --creds devtest-nats.creds consumer info tfy-agent-state-buffer cluster-tfy-willk-usce
Information for Consumer tfy-agent-state-buffer > cluster-tfy-willk-usce created 2024-06-05T16:46:08+05:30
Configuration:
Name: cluster-tfy-willk-usce
Pull Mode: true
Filter Subject: tfy-agent-state-buffer.tfy-willk-usce.>
Deliver Policy: All
Ack Policy: Explicit
Ack Wait: 30.00s
Replay Policy: Instant
Maximum Deliveries: 2,880
Max Ack Pending: 1,000
Max Waiting Pulls: 512
Cluster Information:
Name: nats
Leader: nats-1
Replica: nats-0, current, seen 699ms ago
Replica: nats-2, current, seen 699ms ago
State:
Last Delivered Message: Consumer sequence: 3,517 Stream sequence: 998,846 Last delivery: 58.14s ago
Acknowledgment Floor: Consumer sequence: 3,517 Stream sequence: 998,846 Last Ack: 58.13s ago
Outstanding Acks: 0 out of maximum 1,000
Redelivered Messages: 0
Unprocessed Messages: 0
Waiting Pulls: 1 of maximum 512
❯ nats -s ... --creds devtest-nats.creds stream get --last-for "tfy-agent-state-buffer.tfy-willk-usce.>"
? Select a Stream tfy-agent-state-buffer
Item: tfy-agent-state-buffer#990719 received 2024-06-15 00:38:11.357294174 +0000 UTC on Subject tfy-agent-state-buffer.tfy-willk-usce.application.active-version.tfy-agent.tfy-agent-cp
Headers:
Nats-Rollup: sub
Nats-Msg-Id: 343a147c7869f6291d2759731f4eb666c15ff76c
Expected behavior
The stream will not have any orphan messages.
Server and client version
The server version is mentioned above.
CLI
❯ nats --version
0.1.4
On the consumption side, we use the Go client v1.34.1.
On the publisher side, we use "nats.ws": "^1.21.0".
Host environment
No response
Steps to reproduce
No response
We found this issue while debugging a potential orphan message scenario. We were not receiving the messages in the stream from the interested consumer. Even though there were messages in the stream, the consumer was not showing any outstanding acks or unprocessed messages.
On the Slack thread, @ripienaar suggested that there can be some edge cases due to using the rollup feature in a work queue stream, but I have not verified this yet.
We use rollups to replace existing messages from the stream for some subjects if they have not been processed yet.
We could not model this using the per-subject message limit as, in some cases, I need to keep multiple messages with the same subject.
Sometimes, we want a message of a particular subject to replace any existing unprocessed messages in the stream with the same subject.
We decided to use rollups to model this.
We have made some fixes in the area of stream sync. I would suggest trying 2.10.17-RC3 (formal release will be next week).
https://github.com/nats-io/nats-server/releases
Cool. We will try it once the release is done on our devtest environment.
On 2.10.17 either ordering is broken or orphans even when rollup is used.
@sammy007 please provide details and insights that lead you to this conclusion.
We are running 2.10.17 and have not noticed this issue again yet. We noticed that 2.10.18 also had related fixes and upgraded on one of our environments. I will close this issue.