nats-streaming-server icon indicating copy to clipboard operation
nats-streaming-server copied to clipboard

Nats streaming is not reliable in failure states

Open ssoroka opened this issue 5 years ago • 5 comments

Scenario:

running a cluster with 3 nodes. 1 node dies unexpectedly mid-write, and refuses to come back up, logging errors about setting ContinueOnUnexpectedEOF to reload. Some investigation shows this isn't really as straight-forward as expected, as it's confusing whether or not this is the right thing to do when running in a cluster. Expectation here is that the cluster can restart the node and figure out where to pick up, but it does not.

Action taken:

Investigating tech removes the data folder based loosely on advice from https://github.com/nats-io/nats-streaming-server/#file-store-recovery-errors .. It's unclear if this removed the raft data as well, as we didn't see where that was stored. It's unclear if this caused data loss. Expectation here is that the nats-streaming cluster should be able to automatically recover missing data here.

Result:

Servers come back up, but cluster does not recover cleanly. Next nats streaming server is erroring about [ERR] STREAM: Received invalid client publish message, likely because the client data the service keeps is now incomplete our out of sync.

Had to restart all the clients connecting to nats-streaming in order for them to generate new client ids and restart the servers again.

Note: This is all pretty disturbing and I suspect we're dropping messages here like crazy (will take some time to manually confirm this). I doubt I'm the only person seeing this: https://github.com/nats-io/nats-streaming-server/issues/750

I'm thinking there's some underlying challenges with the architecture due to the fact that this is a streaming service built on top of another pub-sub service, and they seem to have different ideas about how to recover from failure.

I'm not really expecting a resolution to this issue, as I think we've exhausted our tolerance for nats as a whole, but I'd love to hear if you have any recommendations or plans to better address failure scenarios.

ssoroka avatar Feb 22 '19 16:02 ssoroka

@ssoroka Sorry for the delay and your bad experience with NATS Streaming. What file was reported with bad EOF? Was it a message file? If so, what I would have done probably in this context is remove all msgs.x.dat and .idx files from the directory reporting issues, then restart the node (possibly repeating for all other corrupted channels). Since you have 3 nodes, restarting the failed node should have worked: all messages in the raft log in that failed node would have been applied (replayed) and so msgs.x.dat/idx would have been recreated, and any state that is missing there should have been recovered thanks to the leader. Even if snapshot had occurred, the system would be able to restore the data.

kozlovic avatar Feb 25 '19 23:02 kozlovic

Hey @kozlovic. Thanks for responding. yes, it was a msgs.x.dat file (I just checked the Kibana logs). For whatever reason, restarting the node didn't resolve the issue, as it would restart to the same error condition. I suspect removing the files you mentioned and then restarting would resolve the issue, but it was not clear from any docs I could find if removing some or all of the files would result in dropping messages. Ideally, if there is a way the system could detect the cluster setup and automatically recover, that would be ideal.

ssoroka avatar Feb 26 '19 02:02 ssoroka

Thanks for the feedback. Yes, it would be nice that in clustering mode a different approach is automatically taken (the doc you referred is about the file store "corruption" and how to restart a server in that condition, but more in a standalone/ft fashion than in clustering mode). I will keep this issue open in case I come up with a solution for that. Thanks again!

kozlovic avatar Feb 26 '19 21:02 kozlovic

Is this challenge any different with the JetStream approach. I see this issue is not resolved in a permanent solid fashion. And we experience this issue with STAN v0.24.1

LarsBingBong avatar Mar 25 '22 10:03 LarsBingBong

@LarsBingBong JetStream should have better repair mechanism. Also, NATS Streaming has a deprecation notice and big development efforts would be made on JetStream, not NATS Streaming. Of course, as you know since you use v0.24.1 (not the latest but close) we still make some updates to Streaming when needed.

kozlovic avatar Mar 30 '22 20:03 kozlovic