cloudfriend icon indicating copy to clipboard operation
cloudfriend copied to clipboard

scheduled lambdas shouldn't alarm while retries remain

Open Ramshackle-Jamathon opened this issue 5 years ago • 5 comments

proposal: An alarm should only go off if a scheduled lambda is in a terminal error state, that is, its retries are expired and there is no chance that the execution will succeed without manual intervention.

this PR increases the alarm threshold to 3 failure datapoints(periods), according to the aws docs this is the maximum amount of retries that will happen for a cloudwatch triggered lambda https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html.

cc @mapbox/platform

Ramshackle-Jamathon avatar Nov 25 '19 20:11 Ramshackle-Jamathon

Hmm. What about the interaction between lambda invocation time & alarm periods? It seems that this could open the door to a scheduled lambda function misconfigured in such a way that it was always failing, but never failing fast enough to build up to the 3+ errors it'd take to trip the alarm?

rclark avatar Nov 25 '19 20:11 rclark

@rclark setting treatMissingData to missing will handle failures that happen late in executions.

out of scope for this but it might be worth revisiting setting it to notBreaching in any of the lambda shortcuts since its benefit(shorter alarm windows?) is likely outweighed by the fact it makes it easier to not alarm on valid alarm states for M of N alarms.

Ramshackle-Jamathon avatar Nov 25 '19 21:11 Ramshackle-Jamathon

😢 I can't reproduce the CI failure locally

Ramshackle-Jamathon avatar Nov 25 '19 21:11 Ramshackle-Jamathon

Also wanna call out there are quite a few gotchas that stem from async lambdas, especially if they are run on very short schedules, are triggered on frequent events, or are long running.

Unlike other systems that isolate each scheduled execution the failures of one async lambda run might overlap with a subsequent successful run. This makes writing a cloudwatch alarm def that is more nuanced than "alarm on a single failure" very hard (actually Im pretty confident its impossible).

I think the solution for this is to have a DLQ configured for every async lambda and have the cloudwatch alarm configured to look at that queue depth instead of lambda executions failures

Ramshackle-Jamathon avatar Nov 25 '19 21:11 Ramshackle-Jamathon

Also wanna call out there are quite a few gotchas that stem from async lambdas ... makes writing a cloudwatch alarm def that is more nuanced than "alarm on a single failure" very hard

That sentiment makes me wonder whether maybe we just should not make the shortcut alarm definition try to be any more nuanced. If in fact you have to think about it in the context of the function you're writing then, well, you have to think about it instead of expecting the shortcut to do it for you.

rclark avatar Nov 25 '19 23:11 rclark