orca icon indicating copy to clipboard operation
orca copied to clipboard

fix(queue): Reset ackAttempts on successful SQL ack

Open nicolasff opened this issue 1 year ago • 10 comments

Reset ackAttempts from the message metadata when a message is ack'd. This avoids having this number grow over time for very long-lived messages that keep getting re-enqueued, with the occasional failure causing a message to not be acknowledged and eventually get dropped once 5 acks have been missed in total.


More detailed explanation

A long-running Orca Task in a Stage will cause the same RunTask message to be de-queued and re-enqueued over and over with all its attributes as it returns RUNNING to indicate it has not yet finished executing. If any part of the queue management code fails because Orca is too busy or even crashes right after de-queuing a message, the message will have its “un-acknowledged” counter incremented by one (ackAttempts++). For long-running stages, it is possible for a task to eventually reach ackAttempts = 5 (Queue.maxRetries, link) which will cause SqlQueue in Orca to abandon the message and effectively stop processing its branch of the pipeline. When a long-running pipeline execution is still marked as RUNNING but no longer has any messages on the queue, it becomes a “zombie” that can’t be canceled by regular users.

The proposed fix is to reset ackAttempts to zero when a message is processed successfully, as would happen repeatedly with a long-lived stage. Instead of dropping messages when they reach 5 missed acknowledgments in total, we’ll now drop them only if they miss 5 in a row – which gives us a clear indication that the message just cannot be processed at this time.

Consider the analogy of a ping sequence used to monitor a host's uptime: if we leave ping running for 2 weeks monitoring a remote host, do we mark the host as down once it has missed 5 pings in total over these 2 weeks, or when it has missed 5 in a row?

nicolasff avatar Jun 20 '23 21:06 nicolasff

Related to and the cause of https://github.com/spinnaker/spinnaker/issues/6597

xibz avatar Jun 20 '23 21:06 xibz

This looks like a sql-specific change. Mind updating the commit message to indicate that? Or perhaps implement it for redis as well :)

dbyron-sf avatar Jun 20 '23 23:06 dbyron-sf

I've updated the commit message to mention that this only affects the SQL implementation.

I'm not really sure what can be done with Redis, the docs are correct to say that a significant data loss occurs when the Redis backend goes away. If it's even fixable it will be a much more complex change than what is being proposed here.

nicolasff avatar Jun 21 '23 02:06 nicolasff

Thanks @nicolasff!

dbyron-sf avatar Jun 21 '23 19:06 dbyron-sf

What's involved in writing an automated test to exercise this functionality?

dbyron-sf avatar Jun 21 '23 19:06 dbyron-sf

I've been wondering how to write a test for this. The way I went about validating the change involved manual fault injection within Orca in specific places, followed by breakpoints in other places to observe the ackAttempts flag being reset, all the while while keeping tabs on the SQL data in a terminal. Definitely not a repeatable process.

To trigger this condition and cover the if (resetAttemptsOnAck) { condition, we could use:

  1. A SqlQueue instance
  2. A QueueProcessor using a thread pool containing a single thread
  3. An Orca "task" class (com.netflix.spinnaker.orca.api.pipeline.Task), maybe one that calls Thread.sleep() to block that single thread
  4. A thread producing messages and enqueueing them, only for the queue processor to fail to submit them to its thread pool, accumulating missed acks.

I'm going to look into implementing this today.

nicolasff avatar Jun 22 '23 18:06 nicolasff

Latest push: added a Spek test for both values of resetAttemptsOnAck, showing that the ackAttempts attribute is reset on ack when this option is enabled. Also rebased on master (currently b23c78b02c014bfa9bf2f2462d40f889bf97fb37).

edit: test run: image

nicolasff avatar Jul 07 '23 00:07 nicolasff

@dbyron-sf thanks for all the feedback, I'll make the changes you suggested and will look into this discrepancy with the HTML test report. I was thinking to also add some extra steps to the test, validating that the message makes it to the DLQ or not depending on the flag after say 4 processing failures + 1 success + 1 re-enqueue + 1 more failure.

nicolasff avatar Jul 12 '23 19:07 nicolasff

@dbyron-sf thanks for all the feedback, I'll make the changes you suggested and will look into this discrepancy with the HTML test report. I was thinking to also add some extra steps to the test, validating that the message makes it to the DLQ or not depending on the flag after say 4 processing failures + 1 success + 1 re-enqueue + 1 more failure.

Sounds great. The more coverage the better :)

dbyron-sf avatar Jul 12 '23 19:07 dbyron-sf

@dbyron-sf Any thoughts about this? Looks like the requested changes didn't happen but it's potentially a useful fix?

mattgogerly avatar Apr 15 '24 18:04 mattgogerly