rqueue icon indicating copy to clipboard operation
rqueue copied to clipboard

With active-active + cluster mode enabled often messages get processed twice

Open Dbasdeo opened this issue 2 years ago • 7 comments

We had originally implemented a scheduler using your library with Redis elasticache in AWS with cluster mode disabled. Every thing seemed to work out great. For context, our service in production that implements the scheduler runs 3 instances. The scheduled task was always dispatched to a single instance as expected.

We have now tried to switch to an enterprise version of redis with cluster mode enabled and an active-active database. By that I mean that we are running our service in two regions (x3 instances per region) both pointing to different redis databases (i.e. redis in region A and redis in region b) that are replicated in between.

What're application dependencies ?

  • Rqueue Version: 2.8.0-RELEASE
  • Spring Boot Version: 2.4.4
  • Spring Data Redis Version: 2.4.6

How to Reproduce (optional)?

  • This happens maybe 1 in every 10 times we schedule a job. *

We aren't running under any load as we currently only really run smoke tests. We schedule a job with a wait of x seconds. x seconds later the job is consumed and executed by two instances. The instances are always in different regions. I'm guessing library was just never built to handle this case?

Dbasdeo avatar Mar 12 '22 02:03 Dbasdeo

Yeah, it's not built to handle such scenarios. One thing can be done here is, you can turn off Rqueue workers/listeners in 2nd region.

Do you know what's the Redis lag between these two setups? Generally if data is replicated immediately then it should not process duplicate messages instead they should process unique messages.

Also, some more detailed architecture will help me to understand what's happening.

sonus21 avatar Mar 14 '22 11:03 sonus21

There seems to be a bit of lag between replication. When the same job gets dispatched to 2 instances in separate regions and they are then actually both able to update the request (we use redis for persistence as well) without a cas exception maybe even 30ms apart. Im currently trying to figure out what expected lag is. As for our arch we are basically running a service in 2 euro regions each pointing to their own redis db that is replicated in between. We are constantly scheduling jobs in both regions that need to be executed in X minutes and we have found maybe 10% of the time they are processed twice (once in each region). The ideal scenario being that they could be processed in either region but only once. I assume there must be some sort of race condition that exists in this active active setup that didn't exist in a single region that we are exposing.

I think we will just try to handle it ourselves because our holy grail is to be able to handle in multiple regions so we are pretty fault tolerant. Just wanted to know if this had ever come up. We have been considering seeing if we could could fork your code and try to solve ourselves, but I imagine that would be harder for us than implementing in our own service :/.

Dbasdeo avatar Mar 14 '22 17:03 Dbasdeo

@sonus21 I've started to think about the problem in a different way but Im not sure the library allows for it.

  1. Each region creates their own queue, schedules their own message and handles their own results
  2. Periodically, we do a health check on the opposite region. If it is returning 5xx, we start listening from the other regions queue. If we see the health check passing again we stop consuming.

But I can't really see if there is a an exposed way for an instance to start / stop consuming from a queue. Like not using the annotation and manually polling or something. Does this exist?

Dbasdeo avatar Mar 15 '22 21:03 Dbasdeo

We can stop/start listener at runtime but it does not support adding new queue at runtime. Why do you need active active setup for consumers? Should not you stop consumer in another region?

sonus21 avatar Mar 16 '22 06:03 sonus21

Basically we want to be resistant to losing an entire region. If all the consumers in region A are lost (either instances go down or redis goes down), we would like region B to start consuming A's job queue so its not lost. We don't need to add a queue but just start a consumer of an existing queue at runtime. But it seems like all the consumer code is hidden in the library.

Dbasdeo avatar Mar 16 '22 17:03 Dbasdeo

I think it's not correct to use a multi-regions active-active setup for consumers. Even SQS is a single region it's not replicated across regions.

https://stackoverflow.com/questions/66249605/does-aws-sqs-replicate-messages-across-regions

Can you refer me to some articles that suggest this solution for high availability?

Proposed solution:

  • Deploy listener in all regions but enable only in one region using task count
  • Once you identify that the current region has an issue, decrease the task count of all listeners to zero and increase the task count in another region. It can be automated but it means the application might lose some messages that should be fine due to a region issue.

Expected delay in message processing (5-15 minutes)

sonus21 avatar Mar 19 '22 08:03 sonus21

I think your proposed solution sounds like what we are going to attempt. That sounds like the way I was kind of angling now. I was trying to find something like task count so thank you for sharing!

I honestly was just dreaming a bit that this would be possible but sounds like my head was a bit in the clouds.

Dbasdeo avatar Mar 22 '22 12:03 Dbasdeo