sst Function: setting log retention causes `rate exceeded` with a lot of Functions

Problem

Log retentions are being configured via AWS SDK through custom resource. With a lot of Functions, AWS SDK calls are made concurrently and results in rate exceeded error.

Setting logRetentionRetryOptions to 100 retries will fix the rate exceeded error, but causes:

Received response status [FAILED] from custom resource. Message returned: A conflicting operation is currently in progress against this resource. Please try again. (RequestId: 27daaaf0-f4b5-472d-8cae-643d3664f740)

Solution 1

Creating the log group manually for Functions:

this.getAllFunctions().forEach(func => {
	new logs.LogGroup(this, `${func.node.id}-LogGroup`, {
		logGroupName: `/aws/lambda/${func.functionName}`,
		retention: logs.RetentionDays.THREE_MONTHS,
		removalPolicy: scope.defaultRemovalPolicy
	})
})

The issue with this approach is that if the log group already exists, deployment will fail.

Solution 2

Understand what is causing the A conflicting operation error, and either come up with a more sensible LogRetentionRetryOptions config; or rewrite the custom resource for setting log retention to gracefully handle the logic.

Requests: https://serverless-stack.slack.com/archives/C01JG3B20RY/p1634839201037400 https://serverless-stack.slack.com/archives/C01JG3B20RY/p1635375910361800 Dan https://discord.com/channels/983865673656705025/996097936440180767

Oct 21 '21 23:10 fwang

So I am looking into this issue and I have created a GitHub repo to reproduce the issue: https://github.com/garretcharp/sst-log-groups

Basically, just create 50 lambdas attached to an API that all have a log retention set and you will never be able to deploy because it will attempt to create the cloud watch log group using the AWS custom resource and at least one of those is throwing a conflict error.

From a quick search of the error, it seems this happens specifically when trying to put the retention policy after the log group has been created. My assumption is AWS sends a success back on log group creation whenever it is not actually fully created thus trying to set the retention policy right after will occasionally result in conflict errors. I assume this means it can happen at any scale of the application even if you only have 1 lambda with a log retention set, however, it does not always happen so having more lambdas greatly increases the chances of this happening.

I know with other frameworks they don't ever use a custom resource to create log groups so that got me thinking there must be a way to do this without a custom resource, and there is a way. The only issue is that this would be a breaking change for everyone since we would have to change sst.Function and that function would have to create a dependency on a log group in which we would manage.

See this code snippet:

    this.getAllFunctions().forEach(fn => {
      const logicalId = this.getLogicalId(fn.node.defaultChild);

      const logGroup = new logs.LogGroup(this, `${logicalId}LogGroup`, {
        retention: logs.RetentionDays.ONE_WEEK,
        logGroupName: `/aws/lambda/${logicalId}`,
        removalPolicy: scope.defaultRemovalPolicy
      });

      fn.node.addDependency(logGroup);
    })

It is very similar to my previous snippet but instead of just creating the log group we tell the lambda function it depends on the log group which means our log group will always be created before lambda can, thus removing any race condition that lambda itself creates the log group. And this also does work when you redeploy a new version, unlike my last snippet.

This would be a very simple implementation into sst.Function we just add in the log group and set the function to depend on it. Just depends if you want to go down that route knowing how extreme of a breaking change this is.

(and to anyone who wants to copy-paste this snippet make sure you remove anything dealing with log retention from other places either on the function def or via the default function props - and note it requires a fresh deploy so if you have an existing deploy you must destroy it and redeploy)

Oct 27 '21 23:10 garretcharp

@garretcharp thanks for looking into this. Afaik the function's logical ID isn't the function name. Can you double check if the Lambda function is writing to /aws/lambda/${logicalId}?

Can we catch the conflict error in the custom resource and retry?

Oct 27 '21 23:10 fwang

Hmm might have been an oversight on my end on how the logicalId worked then, was following what I found on stack overflow saying not to reference the function to get the name so that you don't have the lambda depending on the log group and that seemed to work for them. Not at my PC anymore so will have to look once I'm back.

I can definitely go the route of retrying on conflict error though since that won't cause a breaking change.

Oct 28 '21 00:10 garretcharp

Yeah, that might be the most reliable option.

Oct 28 '21 02:10 fwang

Hi @garretcharp, I tried the suggestion of adding a dependency from the function to the log and hit a circular dependency issue if you have stacks adding routes to stacks. Thought I'd just mention it in case it helped with what you were doing!

Nov 04 '21 11:11 pa-dentally

Also, just doing solution 1 didn't work for me. It got further on one deploy but sitll failed with the rate limit on one of the stacks. A combination of setting the maxRetries on the defaultFunctionProps and solution 1 seemed to work though:

logRetentionRetryOptions: { maxRetries: 100, },

Nov 04 '21 11:11 pa-dentally

Now I've started hitting the duplicate name issue so the workaround worked for a while but has now stopped... I'm looking into ways to check for the log existing beforehand. I thought log.logFromGroupName might work but that seems to been a bust.

Nov 10 '21 18:11 pa-dentally

We are also running into this issue, sadly.

Sep 05 '22 08:09 nnyegaard