nats-server
nats-server copied to clipboard
JetStream Stream drift and failure to recover consistency observed in a R3 setup backed by file storage
Observed behavior
As part of testing system failure and recovery on a local 3 blade server running a NATS R3 in k8s, we have somehow ended up in an unexpectedly inconsistent state that NATS seems incapable to recover from.
We first observed the issue when consuming messages from a R3 Stream. Different messages were returned depending on which node the Consumer was created in. We're unsure of what exactly caused it but it was somehow triggered by intentionally pulling out the power cables from all servers and booting them back up.
Possible assumptions / Wild Guesses
- The stream state is only verified based on the Meta stream state. This can cause the actual stream data to be completely out of sync. It seems to me that NATS is missing some data integrity checks on the actual stream data when recovering the state from it.
- The stream disk state is only verified based on
Last Sequence
and fails to recover Streams that have a mismatch onFirst Sequence
. There's a var errFirstSequenceMismatch = errors.New("first sequence mismatch") in nats-server code base but no place where this could be returned. This could indicate that at some point there was a check for it and that maybe it should be put back in place. - I understand that RAFT only cares about the log index which I assume translates to
Last Sequence
but from what I understand of NATS's RAFT implementation, it does not follow the Leader Append-Only property because it actually removes old messages from a Stream. I would think that removing messages would not be related with RAFT log at all but I have reasons to believe that's not the case according to the NATS JetSream Clustering docs:
NATS server optimizes this by combining the data plane for replicating messages with the messages RAFT would normally use to ensure consensus
I'm happy to run more tests and provide whatever extra info is needed. After trying to look into this myself, we though it would make sense to have the actual NATS maintainers shine some light into this instead of doing further guess work.
Expected behavior
NATS would be able to recover its state consistency across all replicas
Server and client version
Observed in
- 2.10.16
- 2.10.17-RC.6
Host environment
The NATS server is configured with 10Gb of file storage and no in-memory storage. The underlying volumes are provisioned with OpenEBS LocalPV on top of a multi disk ZFS pool of SSD storage.
Steps to reproduce
I have tried reproducing it locally without any success and have not tried reproducing it again in our test cluster as to not lose the state we have ended up in. I'm mostly convinced we have run into the same issue on a production cluster before but there were too many variables to be able to pinpoint what caused it at the time.
Here's a repo containing the minimum state and all configurations I used that match our cluster setup. It's not a reproduction of what caused the root issue but more of a snapshot of the broken state. I dumped all the state files from the server and re-created the cluster locally using docker-compose after removing all the other streams and consumers. After restarting nodes multiple times I could still observe the state drift and no error logs which indicates to me that NATS unaware of it.
Further investigation
Having no helpful logs and not enough knowledge of how to debug these kind of issues using the cli, I decided to look at the files that make up the JetStream storage.
I focused on a stream named system-config
. After finding it under the correct account directory there were 2 files. <index>.blk
and index.db
. nats-0
and nats-1
contained a 258.blk
and nats-2
contained a 1.blk
. There was some obvious correlation between that and the messages I was getting from the stream - I always got the same results from nats-0
and nats-1
which differed from nats-2
. The stream data in nats-2
was looking correct with 7 messages, which matched the number of subjects with max of 1 message per subject whereas the other 2 nodes only contained the last 3 messages of the stream. So basically there's a match in Last Sequence
but the number of messages, subjects and First Sequence
don't match.
Stream info from node-0
Information for Stream system-config created 2024-06-13 14:49:29
Subjects: config.system.>
Replicas: 3
Storage: File
Options:
Retention: Limits
Acknowledgments: true
Discard Policy: Old
Duplicate Window: 2m0s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Limits:
Maximum Messages: unlimited
Maximum Per Subject: 1
Maximum Bytes: unlimited
Maximum Age: unlimited
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: nats
Leader: nats-0
Replica: nats-1, current, seen 489ms ago
Replica: nats-2, current, seen 490ms ago
State:
Messages: 3
Bytes: 217 B
First Sequence: 1,027 @ 2024-06-19 16:14:50 UTC
Last Sequence: 1,029 @ 2024-06-19 17:45:37 UTC
Active Consumers: 0
Number of Subjects: 3
</details>
Stream info from node-1
Information for Stream system-config created 2024-06-13 14:49:29
Subjects: config.system.>
Replicas: 3
Storage: File
Options:
Retention: Limits
Acknowledgments: true
Discard Policy: Old
Duplicate Window: 2m0s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Limits:
Maximum Messages: unlimited
Maximum Per Subject: 1
Maximum Bytes: unlimited
Maximum Age: unlimited
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: nats
Leader: nats-1
Replica: nats-0, current, seen 491ms ago
Replica: nats-2, current, seen 490ms ago
State:
Messages: 3
Bytes: 217 B
First Sequence: 1,027 @ 2024-06-19 16:14:50 UTC
Last Sequence: 1,029 @ 2024-06-19 17:45:37 UTC
Active Consumers: 0
Number of Subjects: 3
Stream info from node-2
Information for Stream system-config created 2024-06-13 14:49:29
Subjects: config.system.>
Replicas: 3
Storage: File
Options:
Retention: Limits
Acknowledgments: true
Discard Policy: Old
Duplicate Window: 2m0s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Limits:
Maximum Messages: unlimited
Maximum Per Subject: 1
Maximum Bytes: unlimited
Maximum Age: unlimited
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: nats
Leader: nats-2
Replica: nats-0, current, seen 461ms ago
Replica: nats-1, current, seen 461ms ago
State:
Messages: 7
Bytes: 1.1 KiB
First Sequence: 778 @ 2024-06-19 17:19:31 UTC
Last Sequence: 1,029 @ 2024-06-19 17:45:37 UTC
Deleted Messages: 245
Active Consumers: 0
Number of Subjects: 7
Here are a couple of observations I have done by intentionally manipulating the stream state data.
If we run the following in the repro repo:
-
docker-compose up -d
to start the cluster -
nats stream cluster step-down system-config
until you getnats-2
as the leader -
docker-compose stop nats-1
-
sudo rm -rf data/nats-1/jetstream/SYS/_js_/* data/nats-1/jetstream/USERS/streams/system-config
-
docker-compose up nats-1
The state is correctly restored from the leader (nats-2
).
But, if instead we only remove the stream data leaving the Meta stream data in place with sudo rm data/nats-1/jetstream/USERS/streams/system-config
, then nats-1
comes back with a stream containing no data at all while reporting it as current.
Information for Stream system-config created 2024-06-13 14:49:29
Subjects: config.system.>
Replicas: 3
Storage: File
Options:
Retention: Limits
Acknowledgments: true
Discard Policy: Old
Duplicate Window: 2m0s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Limits:
Maximum Messages: unlimited
Maximum Per Subject: 1
Maximum Bytes: unlimited
Maximum Age: unlimited
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: nats
Leader: nats-1
Replica: nats-0, current, seen 491ms ago
Replica: nats-2, current, seen 491ms ago
State:
Messages: 0
Bytes: 0 B
First Sequence: 0
Last Sequence: 0
Active Consumers: 0