go-workflows
go-workflows copied to clipboard
Include max workflow age in stats, for alerting when workflows get "stuck"
We have monitoring configured so that our engineer-on-call is paged if our workflows fail to execute, either due to an acute failure -- a single workflow gets "stuck" -- or due to a performance degradation -- workflows are executing, but slowly, and are starting to back up in the queue.
Currently, this monitoring is based on the total # of pending workflows from GetStats
. But this is an imperfect monitor, because it alerts when the system is completely healthy, and it may miss when a single workflow is unhealthy.
A healthy system executing many workflows may always have a high number of in-flight workflows at any given time, even though each individual workflow is executing correctly. Conversely, The total # of pending workflows may be as low as 1, yet if that 1 workflow is stuck and failing to complete, we should be notified.
I think a better statistic might be "max workflow age," computed per workflow type. For example, if we have one type of workflow that should always finish within 30 minutes, and one is 40 minutes old, we want to be notified. For another type of workflow that finishes within 10 seconds, we want to be notified if any one is 1minute or older.
I'm not sure the best way to implement this, but I think adding a new Max Age statistic can help. GetStats
can return the max workflow age per queue, and we can intentionally put our 30 minute workflows in a different queue than our 10 second workflows.