nats-server icon indicating copy to clipboard operation
nats-server copied to clipboard

Jetstream - R1 file consumer loses offset after server restart

Open goku321 opened this issue 2 years ago • 8 comments

Defect

Make sure that these boxes are checked before submitting your issue -- thank you!

  • [ ] Included nats-server -DV output
  • [x] Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)

Versions of nats-server and affected client libraries used:

nats-server: v2.8.4 NATS cli: v0.0.33

OS/Container environment:

macOS Monterey

Steps or code to reproduce the issue:

  1. Create a 3-nodes JetStream setup.
  2. Create a stream and a R1 consumer.
  3. Publish some messages and start the consumer.
  4. Stop the leader node for the above R1 consumer.

Expected result:

The consumer should get re-assigned to a different node and should work just fine upon restarting (the consumer).

Actual result:

Consumer re-assignment fails sometimes even with a graceful shutdown. The consumer data gets migrated to a different node but nats consume info shows a different (hashed) name for the consumer and upon selecting the consumer, it says "consumer not found". Screenshot 2022-07-12 at 11 37 29 PM

Screenshot 2022-07-13 at 6 15 35 PM

Proposal (Added another point):

  1. Re-assignment should consider multiple possible peers instead of just one - here
  2. There should be a retry logic for failed re-assignments - here. And, a different peer should be tried for the assignment if one fails.
  3. If nothing else, at least ensure that R1 file can restart from where it left off before restart. Can be off by 1/2 messages but not lose it completely

EDIT:

Restarting the leader node (after some downtime) of a R1 consumer doesn't necessarily recovers the consumer since killing the node will attempt to migrate the consumer and if that fails, the consumer is lost.

goku321 avatar Jul 13 '22 13:07 goku321

@goku321 What's your expectation for an R2 consumer in an R5 stream where both consumer nodes are shut down?

In step 2 you write "Create a stream and a R1 consumer". That is a durable consumer? Also mind sharing your stream/consumer config?

matthiashanel avatar Jul 14 '22 18:07 matthiashanel

Hi @matthiashanel, I work with @goku321 .

@goku321 What's your expectation for an R2 consumer in an R5 stream where both consumer nodes are shut down?

We use R1 consumer. We are looking for a DR strategy when the node persisting the R1 consumer is lost forever. Latency is okay but loosing the offset is not. Every message results in a lot of compute so re-processing lots of messages is an issue.

In step 2 you write "Create a stream and a R1 consumer". That is a durable consumer?

Yes.

Also mind sharing your stream/consumer config?

Stream is R3 File, consumer is R1 File.

shiv4289 avatar Jul 15 '22 07:07 shiv4289

@shiv4289 , but loosing the offset is basically what it'd mean if we where to automatically migrate the R1 consumer, once it's machine disappears.

Expected result: The consumer should get re-assigned to a different node and should work just fine upon restarting (the consumer).

Since you write:

Latency is okay but loosing the offset is not

wouldn't it be ok for you to make the durable consumer R3 then? With R=1 you have one copy of the offsets etc... with R=3 that'd be 3.

matthiashanel avatar Jul 15 '22 19:07 matthiashanel

@shiv4289 , but loosing the offset is basically what it'd mean if we where to automatically migrate the R1 consumer, once it's machine disappears.

Expected result: The consumer should get re-assigned to a different node and should work just fine upon restarting (the consumer).

Since you write:

Latency is okay but loosing the offset is not

wouldn't it be ok for you to make the durable consumer R3 then? With R=1 you have one copy of the offsets etc... with R=3 that'd be 3.

Hi @matthiashanel , We started with R3 consumer but we are designing for 1 million consumers and 1000 R3 streams. 1 million R3 consumers does not work as per our Jetstream benchmarks. In our use case, the message throughput is low. We can also tolerate latencies to a couple of minutes.

We tried a 7 node cluster of c5.4x large EC2 machines. The recovery just goes for what looks like forever whenever we restart a machine rendering jetstream unavailable. We prefer durable consumer because consuming apps stay at customers premise. We were then left with only R1. Happy to try if you have more ideas :-)

shiv4289 avatar Jul 18 '22 03:07 shiv4289

@matthiashanel I've added an edit to the original post. We were under the impression that if we restart the leader node of a R1 consumer, the consumer will still be there and it can resume from where it left. But, that was not the case when we tested this scenario. Consumer migration can fail when the node is taken down which makes R1 consumer not very useful for us. For our use-case, we would be happy if we are able to recover a R1 consumer when the leader node is back online or if there's a way to make sure that the migration succeeds consistently.

goku321 avatar Jul 18 '22 08:07 goku321

We are actively working on making this work correctly such that R1 assets, whether stream or consumer can be succesully migrated.

derekcollison avatar Jul 18 '22 15:07 derekcollison

@matthiashanel @derekcollison Hi, we have commented out the consumer migration code on stop server and tested the same. We are using this modified server binary and it works in a predictable way.

  1. When the nats server persisting R1 consumer is down, consumer hangs.
  2. As soon as server comes back the consumer starts consuming. Happy to send a PR to remove the migrateEphemerals() inside a configuration or remove it completely until your solution lands.

Would you like us to submit the PR here removing migrateEphemerals()? When the migration fix lands, you could retain migrate change under a config or just remove it completely. Alternatively, if the fix is landing up very soon we can skip our PR.

shiv4289 avatar Jul 21 '22 11:07 shiv4289

We will be taking a look, for R1s its important they move off the server since we do not know if it is coming back. @matthiashanel may have some more updates.

derekcollison avatar Jul 21 '22 13:07 derekcollison

Hi @matthiashanel any progress on this? Open to collaborate on code/tests if that helps speed this up. Love to have this.

shiv4289 avatar Sep 13 '22 15:09 shiv4289

If the consumer is an R1 but has a durable name, it will not be migrated per release 2.9 which is out now.

So you should be good.

derekcollison avatar Sep 14 '22 02:09 derekcollison

https://github.com/nats-io/nats-server/commit/c19d3907d3ddb812878525132583366ef8782544

Feel free to re-open if needed.

derekcollison avatar Sep 14 '22 03:09 derekcollison

Hello @derekcollison @matthiashanel , is there any plan to implement migration for R1 durable consumer too ? Would you like us to fix and submit the PR for same, we will be happy to work on this, or please let us know if there is any tech limitation which we don't foresee.

sourabhaggrawal avatar Oct 01 '22 12:10 sourabhaggrawal

Feel free to post a PR. You can also move assets now so if you know you are going to take a server offline for an extended period you can move R1 assets.

derekcollison avatar Oct 01 '22 14:10 derekcollison