nats-server
nats-server copied to clipboard
Jetstream - R1 file consumer loses offset after server restart
Defect
Make sure that these boxes are checked before submitting your issue -- thank you!
- [ ] Included
nats-server -DV
output - [x] Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)
Versions of nats-server
and affected client libraries used:
nats-server
: v2.8.4
NATS cli
: v0.0.33
OS/Container environment:
macOS Monterey
Steps or code to reproduce the issue:
- Create a 3-nodes JetStream setup.
- Create a stream and a R1 consumer.
- Publish some messages and start the consumer.
- Stop the leader node for the above R1 consumer.
Expected result:
The consumer should get re-assigned to a different node and should work just fine upon restarting (the consumer).
Actual result:
Consumer re-assignment fails sometimes even with a graceful shutdown. The consumer data gets migrated to a different node but nats consume info
shows a different (hashed) name for the consumer and upon selecting the consumer, it says "consumer not found".
Proposal (Added another point):
- Re-assignment should consider multiple possible peers instead of just one - here
- There should be a retry logic for failed re-assignments - here. And, a different peer should be tried for the assignment if one fails.
- If nothing else, at least ensure that R1 file can restart from where it left off before restart. Can be off by 1/2 messages but not lose it completely
EDIT:
Restarting the leader node (after some downtime) of a R1 consumer doesn't necessarily recovers the consumer since killing the node will attempt to migrate the consumer and if that fails, the consumer is lost.
@goku321 What's your expectation for an R2 consumer in an R5 stream where both consumer nodes are shut down?
In step 2 you write "Create a stream and a R1 consumer". That is a durable consumer? Also mind sharing your stream/consumer config?
Hi @matthiashanel, I work with @goku321 .
@goku321 What's your expectation for an R2 consumer in an R5 stream where both consumer nodes are shut down?
We use R1 consumer. We are looking for a DR strategy when the node persisting the R1 consumer is lost forever. Latency is okay but loosing the offset is not. Every message results in a lot of compute so re-processing lots of messages is an issue.
In step 2 you write "Create a stream and a R1 consumer". That is a durable consumer?
Yes.
Also mind sharing your stream/consumer config?
Stream is R3 File, consumer is R1 File.
@shiv4289 , but loosing the offset is basically what it'd mean if we where to automatically migrate the R1 consumer, once it's machine disappears.
Expected result: The consumer should get re-assigned to a different node and should work just fine upon restarting (the consumer).
Since you write:
Latency is okay but loosing the offset is not
wouldn't it be ok for you to make the durable consumer R3 then? With R=1 you have one copy of the offsets etc... with R=3 that'd be 3.
@shiv4289 , but loosing the offset is basically what it'd mean if we where to automatically migrate the R1 consumer, once it's machine disappears.
Expected result: The consumer should get re-assigned to a different node and should work just fine upon restarting (the consumer).
Since you write:
Latency is okay but loosing the offset is not
wouldn't it be ok for you to make the durable consumer R3 then? With R=1 you have one copy of the offsets etc... with R=3 that'd be 3.
Hi @matthiashanel , We started with R3 consumer but we are designing for 1 million consumers and 1000 R3 streams. 1 million R3 consumers does not work as per our Jetstream benchmarks. In our use case, the message throughput is low. We can also tolerate latencies to a couple of minutes.
We tried a 7 node cluster of c5.4x large EC2 machines. The recovery just goes for what looks like forever whenever we restart a machine rendering jetstream unavailable. We prefer durable consumer because consuming apps stay at customers premise. We were then left with only R1. Happy to try if you have more ideas :-)
@matthiashanel I've added an edit to the original post. We were under the impression that if we restart the leader node of a R1 consumer, the consumer will still be there and it can resume from where it left. But, that was not the case when we tested this scenario. Consumer migration can fail when the node is taken down which makes R1 consumer not very useful for us. For our use-case, we would be happy if we are able to recover a R1 consumer when the leader node is back online or if there's a way to make sure that the migration succeeds consistently.
We are actively working on making this work correctly such that R1 assets, whether stream or consumer can be succesully migrated.
@matthiashanel @derekcollison Hi, we have commented out the consumer migration code on stop server and tested the same. We are using this modified server binary and it works in a predictable way.
- When the nats server persisting R1 consumer is down, consumer hangs.
- As soon as server comes back the consumer starts consuming. Happy to send a PR to remove the migrateEphemerals() inside a configuration or remove it completely until your solution lands.
Would you like us to submit the PR here removing migrateEphemerals()? When the migration fix lands, you could retain migrate change under a config or just remove it completely. Alternatively, if the fix is landing up very soon we can skip our PR.
We will be taking a look, for R1s its important they move off the server since we do not know if it is coming back. @matthiashanel may have some more updates.
Hi @matthiashanel any progress on this? Open to collaborate on code/tests if that helps speed this up. Love to have this.
If the consumer is an R1 but has a durable name, it will not be migrated per release 2.9 which is out now.
So you should be good.
https://github.com/nats-io/nats-server/commit/c19d3907d3ddb812878525132583366ef8782544
Feel free to re-open if needed.
Hello @derekcollison @matthiashanel , is there any plan to implement migration for R1 durable consumer too ? Would you like us to fix and submit the PR for same, we will be happy to work on this, or please let us know if there is any tech limitation which we don't foresee.
Feel free to post a PR. You can also move assets now so if you know you are going to take a server offline for an extended period you can move R1 assets.