orca
orca copied to clipboard
fix(queue): Reset ackAttempts on successful SQL ack
Reset ackAttempts
from the message metadata when a message is ack'd.
This avoids having this number grow over time for very long-lived messages that keep getting re-enqueued, with the occasional failure causing a message to not be acknowledged and eventually get dropped once 5 acks have been missed in total.
More detailed explanation
A long-running Orca Task in a Stage will cause the same RunTask
message to be de-queued and re-enqueued over and over with all its attributes as it returns RUNNING
to indicate it has not yet finished executing. If any part of the queue management code fails because Orca is too busy or even crashes right after de-queuing a message, the message will have its “un-acknowledged” counter incremented by one (ackAttempts++
). For long-running stages, it is possible for a task to eventually reach ackAttempts = 5
(Queue.maxRetries
, link) which will cause SqlQueue
in Orca to abandon the message and effectively stop processing its branch of the pipeline. When a long-running pipeline execution is still marked as RUNNING
but no longer has any messages on the queue, it becomes a “zombie” that can’t be canceled by regular users.
The proposed fix is to reset ackAttempts
to zero when a message is processed successfully, as would happen repeatedly with a long-lived stage. Instead of dropping messages when they reach 5 missed acknowledgments in total, we’ll now drop them only if they miss 5 in a row – which gives us a clear indication that the message just cannot be processed at this time.
Consider the analogy of a ping
sequence used to monitor a host's uptime: if we leave ping
running for 2 weeks monitoring a remote host, do we mark the host as down once it has missed 5 pings in total over these 2 weeks, or when it has missed 5 in a row?
Related to and the cause of https://github.com/spinnaker/spinnaker/issues/6597
This looks like a sql-specific change. Mind updating the commit message to indicate that? Or perhaps implement it for redis as well :)
I've updated the commit message to mention that this only affects the SQL implementation.
I'm not really sure what can be done with Redis, the docs are correct to say that a significant data loss occurs when the Redis backend goes away. If it's even fixable it will be a much more complex change than what is being proposed here.
Thanks @nicolasff!
What's involved in writing an automated test to exercise this functionality?
I've been wondering how to write a test for this. The way I went about validating the change involved manual fault injection within Orca in specific places, followed by breakpoints in other places to observe the ackAttempts
flag being reset, all the while while keeping tabs on the SQL data in a terminal. Definitely not a repeatable process.
To trigger this condition and cover the if (resetAttemptsOnAck) {
condition, we could use:
- A
SqlQueue
instance - A
QueueProcessor
using a thread pool containing a single thread - An Orca "task" class (
com.netflix.spinnaker.orca.api.pipeline.Task
), maybe one that callsThread.sleep()
to block that single thread - A thread producing messages and enqueueing them, only for the queue processor to fail to submit them to its thread pool, accumulating missed acks.
I'm going to look into implementing this today.
Latest push: added a Spek test for both values of resetAttemptsOnAck
, showing that the ackAttempts
attribute is reset on ack when this option is enabled. Also rebased on master
(currently b23c78b02c014bfa9bf2f2462d40f889bf97fb37).
edit: test run:
@dbyron-sf thanks for all the feedback, I'll make the changes you suggested and will look into this discrepancy with the HTML test report. I was thinking to also add some extra steps to the test, validating that the message makes it to the DLQ or not depending on the flag after say 4 processing failures + 1 success + 1 re-enqueue + 1 more failure.
@dbyron-sf thanks for all the feedback, I'll make the changes you suggested and will look into this discrepancy with the HTML test report. I was thinking to also add some extra steps to the test, validating that the message makes it to the DLQ or not depending on the flag after say 4 processing failures + 1 success + 1 re-enqueue + 1 more failure.
Sounds great. The more coverage the better :)
@dbyron-sf Any thoughts about this? Looks like the requested changes didn't happen but it's potentially a useful fix?