opentelemetry-lambda
opentelemetry-lambda copied to clipboard
Improving lambda Cold start
Is your feature request related to a problem? Please describe.
A Lambda cold start happens when a new instance of a Lambda function must be created and initialized. The cold start refers to the delay between invocation and runtime created by the initialization process. New instances needs to be initialized whenever other instances have expired due to inactivity or when there are more invocations than active instances. Cold starts are an inherent problem with Lambda functions because it is not possible to keep lambda initialized forever.
The OpenTelemetry SDK was not created with Lambda functions in mind. If you use OpenTelemetry inside a lambda function, the overhead of initializing the SDK and optionally auto instrumenting the application code adds up in the cold start time. This is specially painful for users because this inserts high latency in their application and increases the cost of running the lambdas.
Describe the solution you'd like
This proposal will tackle the cold start time of the OpenTelemetry lambda layers with the following plan:
Plan:
- Continuously measure the cold start time of the layers with each release. This will help catching regressions in performance and also show trends and where we should invest our time.
- Profile each layer to identify where all this time is spent on in the code.
- Propose optimizations in the initialization of the SDK in each layer: Using the profiling information from the previous step, look for low hanging fruits and also more complex refactoring that will improve the performance.
Methodology for measuring the cold startup:
- Measure the cold start time for lambdas with and without the layers for each supported layer.
- Create a sample application and deploy to a lambda function
- Generate load for this sample application
- Vary a parameter in the lambda function that will force the lambda to be recreated.
- Parse the logs of the lambda function with the following query:
filter @type="REPORT"
| filter ispresent(@initDuration)
| stats count(@initDuration) as coldStartCount, pct(@initDuration, 50) as p50Init, pct(@initDuration, 90) as p90Init, pct(@initDuration, 99) as p99Init group by @log
Methodology for profiling the lambda functions:
- TBD - We will need to
Additional context References
https://github.com/open-telemetry/opentelemetry-lambda/issues/263
Cool! Glad we're addressing this issue. :-)
Hi, is there any timeline or proposals in planning or in place to address the issues shown via metrics on cold start being recorded?