nats-server
nats-server copied to clipboard
JetStream stream lost when processes are killed and restarted [v2.10.20]
Observed behavior
In version 2.10.20, it looks as if a handful of process crashes can cause NATS Jetstream to forget that a stream ever existed. I've reproduced this with both three and five-node clusters, with replication factor 3 and 5. This occurs even with sync_interval=always, as well as the default two minute sync interval.
This test creates a single Jetstream stream called jepsen-stream and publishes a series of unique values to a single subject (jepsen.0) within it.
After killing a few nats-server processes with kill -9, attempts to publish messages throw 503 No Responders Available For Request, and attempts to subscribe to the subject throw Can't subscribe, [SUB-90007] No matching streams for subject. This persists even when we restart every node and stop killing them. Calling JetStreamManager.getStreamNames() will return an empty list, rather than ["jepsen-stream"]. This state of affairs seems to last indefinitely--here's a test where we waited 10,000 seconds for recovery, and the stream never came back.
You'll find node logs here--nothing obvious is jumping out at me. 20250509T191519.377-0500.zip.
I wanted to check--is this... expected behavior? Am I perhaps holding NATS wrong somehow? You can find the NATS Java code I'm calling here: https://github.com/jepsen-io/nats/blob/9e52d9cf0c5f94d436efbfef9e2f2e1288ad7b0f/src/jepsen/nats/client.clj#L78-L136.
Expected behavior
Jetstream streams should not vanish permanently? The point of Jetstream is that they're supposed to be persistent, right?
Server and client version
Server: 2.10.20 Client: io.nats/jnats "2.21.1"
Host environment
Right now these nodes are running Debian 12, running in 3 or 5-node clusters under LXC.
Steps to reproduce
You can reproduce this by cloning the test suite linked above, at commit 9e52d9, setting up a Jepsen environment, and running lein run test --rate 100 --time-limit 300 --nemesis kill --test-count 10 --sync-interval always. Usually manifests after just a few minutes.
I wanted to check--is this... expected behavior?
Definitely must not happen. If the stream create is successful then at least a quorum of servers is supposed to know about that and it should remain, until you explicitly send a delete request.
This sounds like a familiar bug to me, and I think that was fixed already some time ago. I think I recall reproducing this by creating the R3 stream, and quickly after killing all the servers. So definitely really bad, and unexpected from user's POV.
I'm currently on holiday (in a train on my phone, hehe) so can't fully inspect and run the provided code. But.. I highly recommend to upgrade to the latest 2.10.x or 2.11.x version. At this moment the latest versions are 2.10.29 and 2.11.3
Since 2.10.23 and onwards loads of Raft and replication related bugs were squashed (lots of replication bugs were fixed for the 2.11 series, and we backported it to 2.10 since that version). We have been using a tool called Antithesis for quite some time now as well to reproduce and fix various issues as well.
My expectation is that it should be fixed on the later versions. If not, I'd be happy to dive into this myself once back from holiday (if other maintainers don't get to it first of course).
Thank you for the report!
As a point of reference, we ran these tests for a few hours using the jepsen docker compose environment. In our environments the test passes ("Everything looks good! ヽ(‘ー`)ノ") for 2.10.20, and with 2.10.29 and 2.11.3. In general ':no-responders' can be expected during leader election in runs with hard kills of servers, but we see them in our local runs as well.
We don't see "Can't subscribe, [SUB-90007] No matching streams for subject." I noticed that it happens in the "final generator" phase in the original logs.
We ran the test against each version for 30-60 minutes. Attaching one of the 2.10.20 outputs.
jepsen-nats-2.10.20-20250513T031153.555Z.tar.gz
We are interested in your results with later versions in your environment.
@aphyr Have you had the chance to confirm with a more recent version? 2.11.3 is the most recent GA version, also just in the process of releasing 2.11.4-RC.2. Let me know if you'd like to chat more on it.
Ah, sorry, I had to move back to paid contracting--haven't had a chance to look at this since!
No problem, we'll keep looking into this on our side but do let us know if you pick this up again, we'll be happy to help.
Hey y'all! I finally have time to go back and look at this. It looks to me like it was extant in 2.10.20 and 2.10.22, then resolved in 2.10.23. It looks fairly catastrophic to me: simple process crashes have a decent chance to cause total data loss, and I can generally reproduce it in just a few minutes. Here's another test run, on 2.10.22:
I've been looking at the changelog for 2.10.23 and nothing is really jumping out as a "by the way NATS could lose all your data" sort of bug. Is it possible that it was #6061?
2.10 is no longer supported. Please test in latest 2.11 or 2.12 release.
This was indeed one of the many bugs that were fixed while we were working on 2.11. I had reproduced this bug a little over a year ago by creating the replicated stream and shortly after hard killing all the servers (within ~1 minute of creating it): https://github.com/nats-io/nats-server/pull/5946. That wasn't the proper fix though, and https://github.com/nats-io/nats-server/pull/5700 ended up being the correct fix which later got pulled into 2.10.23 with many other Raft-related fixes.
Like @derekcollison mentions, the 2.10.x series is not supported anymore. Many bugs have been fixed since, and we'd highly recommend to upgrade to the latest 2.11.10 or 2.12.1 versions (with 2.11.11 and 2.12.2 to be released early next week).
Excellent--and yes, I know this is an old version, I just wanted to close out the results I started with. :-)