cloudfriend
cloudfriend copied to clipboard
scheduled lambdas shouldn't alarm while retries remain
proposal: An alarm should only go off if a scheduled lambda is in a terminal error state, that is, its retries are expired and there is no chance that the execution will succeed without manual intervention.
this PR increases the alarm threshold to 3 failure datapoints(periods), according to the aws docs this is the maximum amount of retries that will happen for a cloudwatch triggered lambda https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html.
cc @mapbox/platform
Hmm. What about the interaction between lambda invocation time & alarm periods? It seems that this could open the door to a scheduled lambda function misconfigured in such a way that it was always failing, but never failing fast enough to build up to the 3+ errors it'd take to trip the alarm?
@rclark setting treatMissingData to missing
will handle failures that happen late in executions.
out of scope for this but it might be worth revisiting setting it to notBreaching
in any of the lambda shortcuts since its benefit(shorter alarm windows?) is likely outweighed by the fact it makes it easier to not alarm on valid alarm states for M of N alarms.
😢 I can't reproduce the CI failure locally
Also wanna call out there are quite a few gotchas that stem from async lambdas, especially if they are run on very short schedules, are triggered on frequent events, or are long running.
Unlike other systems that isolate each scheduled execution the failures of one async lambda run might overlap with a subsequent successful run. This makes writing a cloudwatch alarm def that is more nuanced than "alarm on a single failure" very hard (actually Im pretty confident its impossible).
I think the solution for this is to have a DLQ configured for every async lambda and have the cloudwatch alarm configured to look at that queue depth instead of lambda executions failures
Also wanna call out there are quite a few gotchas that stem from async lambdas ... makes writing a cloudwatch alarm def that is more nuanced than "alarm on a single failure" very hard
That sentiment makes me wonder whether maybe we just should not make the shortcut alarm definition try to be any more nuanced. If in fact you have to think about it in the context of the function you're writing then, well, you have to think about it instead of expecting the shortcut to do it for you.