aurora icon indicating copy to clipboard operation
aurora copied to clipboard

Count number of times partitioned tasks reenter the cluster as healthy

Open DavidMcLaughlin opened this issue 7 years ago • 0 comments

Currently when a task is PARTITIONED and LOST, Aurora reschedules a replacement. Later on, the task can send a message saying it was healthy and then Aurora will kill the old task. Receiving this signal is a huge indicator that you could avoid unnecessary churn in the cluster by extending timeouts.

Add a metric to monitor how often this use case happens.

DavidMcLaughlin avatar Aug 14 '18 23:08 DavidMcLaughlin