pulumi-aws icon indicating copy to clipboard operation
pulumi-aws copied to clipboard

Artificial limiting of Lambda Function creation

Open jaxxstorm opened this issue 3 years ago • 2 comments

What happened?

The underlying bridged provider artificially limits the number of lambda functions that can be uploaded in parallel.

The cause of this is https://github.com/hashicorp/terraform/issues/9364 and the fix was implemented in here:

To resolve this, some logic was implemented to create a mutex.Lock was added to the lambda function upload, which slows down the provisioning of lambdas. You can see the effect of this here:

https://github.com/hashicorp/terraform/pull/9667/files

Steps to reproduce

import pulumi
import pulumi_aws as aws

 for i in range(50):
     name = f'pulumi-{str(i).rjust(2,"0")}'
     aws.lambda_.Function(
         name,
         aws.lambda_.FunctionArgs(
             name=name,
             code=pulumi.FileArchive("hello-world.zip"),
             role="arn:aws:iam::616138583583:role/lbriggs-lambda-test",
             handler="hello-world.handler",
             runtime="nodejs12.x",
         ),
     )

Expected Behavior

It creates the functions in parallel

Actual Behavior

This takes around 4/5 minutes and you can observe the creation slowing down and operating in serial

Output of pulumi about

No response

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

jaxxstorm avatar Nov 09 '22 22:11 jaxxstorm

Note that this issue is tracking us making a decision on whether to make a different decision than upstream on this behaviour - so is not "awaiting-upstream".

The current locking incurs a very significant performance penalty in cases like the one above, and provides no way to work around. Whereas the original issue that this locking was introduced to fix is much rarer, and more within the control of the user who can apply --parallel to limit the parallelism (and thus concurrent memory usage).

lukehoban avatar Nov 13 '22 19:11 lukehoban

We may want to prioritize a decision here. #3740 reported the same issue here, however, StapStart+Publish settings make the code updates much longer (minutes instead of seconds), so the whole UX get much worse.

Note that the issue applies to Updates as well as Creates.

mikhailshilkov avatar Mar 27 '24 23:03 mikhailshilkov

I'm leaning towards taking a different decision than upstream and consequently removing the serialization for lambda upload.

The underlying issue that was fixed upstream (https://github.com/hashicorp/terraform/issues/9364) only surfaces when dealing with many (10+) Lambdas in a memory constrained environment. Additionally, this fix is ~8 years old. Worth checking if this is still reproducible. For example, the Lambda resource in upstream was switched to the AWS SDK v2 since then.

Apart from that, there's two user facing workarounds that I can think of.

  1. The --parallel flag @lukehoban mentioned to control the number of parallel resource operations
  2. Uploading the lambda code to S3 using aws.s3.BucketObjectv2. This is the mandatory approach for Lambdas +50MB. Instead of loading the whole blob into memory, it streams the content to S3.

I'm going to create a prototype that removes the artificial serialization and will test the mentioned workarounds for feasability.

flostadler avatar May 22 '24 14:05 flostadler

After removing the artificial serialization I was able to verify that the underlying issue still exists and both workarounds mentioned above are viable.

This prototype PR includes the patch that removes the serialization as well as the necessary Pulumi programs to replicate the issue and test workarounds: https://github.com/pulumi/pulumi-aws/pull/3976

I'm testing with 25MB artifacts as that's right in the middle of the max size (50MB) for direct uploads. The tests create 50 Lambdas with those artifacts.

Findings

  • With default settings (--parallel 48) the aws provider peaks at ~28GB virtual memory usage (~3GB RSS, indicating heavy swapping) and fails after ~20min because the requests to AWS time out (Sigv4 signature expired).
  • When limiting allowed parallel resource operations to 8 everything worked, but the memory usage of the AWS provider peaked at 2.8GB. pulumi up took ~15min.
  • When using the workaround of uploading the lambda.zip to S3 there were no limits for parallel resource operations and pulumi up took ~4.5min. The memory usage of the AWS provider stayed below 300MB throughout the operation.

For reference, the baseline without the fix took 26 minutes to complete pulumi up.

What's interesting is the memory usage when directly uploading the lambda code. 50 lambdas, 25MB each, add up to 1.25GB, which is only roughly 5% of the observed memory usage. The same could be seen in the original upstream bug report: https://github.com/hashicorp/terraform/issues/9364. I'm gonna have a look into this as there might be some unnecessary memory allocations/copying going on.

flostadler avatar May 23 '24 10:05 flostadler

After profiling the provider with pprof I found two main places where memory is allocated.

awsRestjson1_serializeDocumentFunctionCode

This converts the zip archive to a base64 encoded string https://github.com/aws/aws-sdk-go-v2/blob/main/service/lambda/serializers.go#L6178

In my test run with 50 lambdas à 25MB (1.25GB total) this allocated 3.49GB. This sounds reasonable given that this needs to:

  • copy the binary slice
  • convert it and store a new base64 string
  • create the request body with that base64 string

defaultRequestBodyLogger::Log

This is the upstream provider's HTTP logging middleware for the AWS SDK: https://github.com/hashicorp/aws-sdk-go-base/blob/main/logging/http.go#L118

It is used to emit debug logs with the request body here: https://github.com/hashicorp/aws-sdk-go-base/blob/main/logger.go#L140

Under the hood this creates multiple copies of the request body, which explains the high memory usage.

Even though this log message is only emitted when debug logging is turned on, the code for preparing the message is executed regardless of the log level.

This is my main suspicion right now. I'm gonna test whether removing the lambda binary from the log fixes the problem.

flostadler avatar May 23 '24 14:05 flostadler

After removing the request logging for Lambda the issue is fixed!

To compare the effects of removing the artificial limiting I've executed a set of tests on a c6i.2xlarge EC2 instance (each test was executed 5 times, measurements are the max). The EC2 instance does not have swap enabled (by default):

1. Baseline with artificial limiting (pulumi-aws 6.37.0). Default --parallel of 32

pulumi up takes ~8.5 minutes and memory peaks at ~0.9GB RSS

2. Without artificial limiting (https://github.com/pulumi/pulumi-aws/pull/3976/commits/d8152272c49132cd7839aa16b179eea3178c8500). Default --parallel of 32

pulumi up takes ~2 minutes and memory peaks at ~7.1GB RSS This is very interesting compared to the 28GB memory usage on my M3 MacBook mentioned before.

My current working theory is that this is driven by the lower network bandwidth locally in combination with retries. The Sigv4 of an AWS API call expires if the call takes longer than 5 minutes. When that happens the SDK retries the call and the HTTP request logger copies the body again and logs it, leading to increased memory usage.

3. Without artificial limiting (https://github.com/pulumi/pulumi-aws/pull/3976/commits/d8152272c49132cd7839aa16b179eea3178c8500). --parallel of 48, bandwidth limited to 25Mbit/s UP and 50Mbit/s down

pulumi up fails after ~19 minutes and consumes 14.8GB RSS (93% of the instances memory). The errors are Sigv4 signature expired. This is in line with the behavior locally on my M3 Mac. This shows that the memory issues around HTTP request logging get amplified through retries and proofs the theory in 2).

4. Without artificial limiting and Lambda HTTP body logging turned off (https://github.com/pulumi/pulumi-aws/pull/3976/commits/0c4044c811d667c3283f4d60afefaa3e054c8471). Default --parallel of 32

pulumi up takes ~1:30 minutes and memory peaks at 4.1GB RSS

5. Without artificial limiting and Lambda HTTP body logging turned off (https://github.com/pulumi/pulumi-aws/pull/3976/commits/0c4044c811d667c3283f4d60afefaa3e054c8471). --parallel 8

pulumi up takes ~1:50 minutes and memory peaks at 1.2GB RSS

flostadler avatar May 24 '24 11:05 flostadler

So to summarize the findings:

  • The high memory usage during Lambda creation/updates was indeed caused by logging the request bodies as this involves multiple copies & memory allocations.
  • Removing the artificial serialization of Lambda creation/updates does not lead to unusual memory consumption when the request body logging is removed. The memory usage is in line with what's expected to assemble the AWS API requests for directly uploading Lambda code (converting archive to base64 and creating request body out of that).
  • Using --parallel or uploading the Lambda Code to S3 gives users control over the memory usage

Based on those findings I'm proposing to fix the HTTP request logger and remove the artificial limiting of Lambda creation/updates. This should result in a drastic speed up for users with many lambdas. My tests have shown a 5.5x increase for directly uploading 25MB lambda archives. Users with smaller archives or users that upload the archives to S3 should see even bigger speed ups

flostadler avatar May 24 '24 12:05 flostadler

I was able to create a patch that fixes the memory issues: https://github.com/pulumi/pulumi-aws/pull/3976

I'm drafting upstream PRs right now, but I'm proposing that we still go forward with the patch for the time being because this change will touch two different repos upstream (hashicorp/aws-sdk-go-base & hashicorp/terraform-provider-aws). My hunch is that this it take a substantial amount of time to get merged in (even small patches for panics take weeks). And in case it goes faster we can remove the patch again.

@t0yv0 @mjeffryes @corymhall What do you think about that approach?

flostadler avatar May 31 '24 16:05 flostadler

Great work! I'd patch and release first and propose upstream PR, any reason to keep our users waiting? One reason would be if there is uncertainty or risk in the patch. Looking forward to reviewing this next week.

t0yv0 avatar May 31 '24 16:05 t0yv0

I'd say the main risk with the patch is the maintenance effort. Given that the patch is fairly isolated (it only touches on AWS SDK middleware) I'd not keep the users waiting longer.

flostadler avatar Jun 04 '24 15:06 flostadler

I'd say the main risk with the patch is the maintenance effort.

If that becomes the problem we can revert right. I'm for merging.

t0yv0 avatar Jun 04 '24 17:06 t0yv0

This was fixed and released as part of v6.39.0.

flostadler avatar Jun 11 '24 09:06 flostadler