opentelemetry-lambda
opentelemetry-lambda copied to clipboard
Observing delays in otel collector extension
Describe the bug
We are using the otel collector in decouple mode and observing that the extension is taking > 2s to start. On further investigation, we are noticing a delay of ~1.5-2s between the log statements
Everything is ready. Begin running and processing data. and EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN]
Is so much delay expected?
Hi @bhaskarbanerjee, is it possible to set OTEL_LOG_LEVEL env var to true and share the debug logs with us.
Thanks for quick reponse @serkan-ozal. Here it is otel-lambda-collector.log
2024-08-26T13:39:55.810-05:00 | {"level":"warn","ts":1724697595.8106759,"caller":"localhostgate/featuregate.go:63","msg":"The default endpoints for all servers in components will change to use localhost instead of 0.0.0.0 in a future version. Use the feature gate to preview the new default."",""feature gate ID"":""component.UseLocalHostAsDefaultHost""}
THIS IS WHERE WE SEE A DELAY OF ~2-4 seconds.
2024-08-26T13:39:57.228-05:00 | EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN] 2024-08-26T13:39:57.229-05:00 | START RequestId: 6a59d6bb-e5ce-46d7-9b70-37069cf7f12a Version: $LATEST
Here is my config.
otlp:
protocols:
http:
endpoint: "localhost:4318"
exporters:
logging:
loglevel: info
otlphttp:
endpoint: ${OTEL_GATEWAY_ENDPOINT}
retry_on_failure:
initial_interval: ${INITIAL_INTERVAL}
max_interval: ${MAX_INTERVAL}
max_elapsed_time: ${MAX_ELAPSED_TIME}
timeout: ${TRANSPORT_TIMEOUT}
sending_queue:
queue_size: 5000
processors:
decouple:
max_queue_size: 5000
service:
pipelines:
traces:
receivers: [otlp]
exporters: [logging, otlphttp]
processors: [decouple]
metrics:
receivers: [otlp]
exporters: [logging, otlphttp]
processors: [decouple]
@bhaskarbanerjee
While I am not fully sure at the moment, I think there is Lambda runtime and function init (with OTEL SDK init) between Everything is ready. Begin running and processing data. and EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN] logs.
In my simple example, I have seen runtime and function init logs between those collector extension logs (~1.5s).
To check this, you may also put print debug log statement to the first and last lines (and also some other important points) of handler.py file to measure the load start/end timings of your handler module/file (not the handler function, but the handler file)
Thanks @serkan-ozal . I have other lambdas too which are not using opentelemetry and in those cases, there is no such wait time. here is an example of using New Relic layer and another layer from Hashicorp in the same lambda.
2024-08-28T09:42:32.881-05:00EXTENSION Name: newrelic-lambda-extension State: Ready Events: [INVOKE, SHUTDOWN] |
EXTENSION Name: newrelic-lambda-extension State: Ready Events: [INVOKE, SHUTDOWN]
2024-08-28T09:42:32.881-05:00EXTENSION Name: vault-lambda-extension State: Ready Events: [INVOKE, SHUTDOWN] |
EXTENSION Name: vault-lambda-extension State: Ready Events: [INVOKE, SHUTDOWN]
2024-08-28T09:42:32.886-05:00START RequestId: 81648590-a693-4abb-b0e4-26b0a067a0f8 Version: $LATEST
In this case the Lambda lifecycle API shows it took 5ms for it to start since the extensions came to ready state, which is quite acceptable.
@bhaskarbanerjee
Here are the logs from my sample Nodejs Lambda:
With OTEL SDK + User Handler Init
2024-08-28T20:55:31.775Z {"level":"info","ts":1724878531.7748003,"caller":"[email protected]/otlp.go:102","msg":"Starting GRPC server","kind":"receiver","name":"otlp","data_type":"traces","endpoint":"localhost:4317"}
2024-08-28T20:55:31.775Z {"level":"info","ts":1724878531.7753932,"caller":"[email protected]/otlp.go:152","msg":"Starting HTTP server","kind":"receiver","name":"otlp","data_type":"traces","endpoint":"localhost:4318"}
2024-08-28T20:55:31.775Z {"level":"info","ts":1724878531.775681,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:55:31.775Z {"level":"info","ts":1724878531.7757266,"caller":"localhostgate/featuregate.go:63","msg":"The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. Disable the feature gate to temporarily revert to the previous default.","feature gate ID":"component.UseLocalHostAsDefaultHost"}
2024-08-28T20:55:32.249Z
Registering OpenTelemetry
...
2024-08-28T20:55:33.288Z EXTENSION Name: collector State: Ready Events: [SHUTDOWN, INVOKE]
2024-08-28T20:55:33.290Z START RequestId: 5ef7f288-ba75-4637-b44b-3ae9cd4ccf13 Version: $LATEST
2024-08-28T20:55:33.296Z 2024-08-28T20:55:33.296Z 5ef7f288-ba75-4637-b44b-3ae9cd4ccf13 INFO Serving lambda request.
Time diff between collector logs: 1,513 ms
2024-08-28T20:55:31.775Z {"level":"info","ts":1724878531.775681,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:55:33.288Z EXTENSION Name: collector State: Ready Events: [SHUTDOWN, INVOKE]
With OTEL SDK + No User Handler Init
2024-08-28T20:33:34.018Z {"level":"info","ts":1724877214.018901,"caller":"[email protected]/otlp.go:102","msg":"Starting GRPC server","kind":"receiver","name":"otlp","data_type":"metrics","endpoint":"localhost:4317"}
2024-08-28T20:33:34.019Z {"level":"info","ts":1724877214.0191104,"caller":"[email protected]/otlp.go:152","msg":"Starting HTTP server","kind":"receiver","name":"otlp","data_type":"metrics","endpoint":"localhost:4318"}
2024-08-28T20:33:34.019Z {"level":"info","ts":1724877214.019237,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:33:34.019Z {"level":"info","ts":1724877214.0193567,"caller":"localhostgate/featuregate.go:63","msg":"The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. Disable the feature gate to temporarily revert to the previous default.","feature gate ID":"component.UseLocalHostAsDefaultHost"}
2024-08-28T20:33:34.447Z Registering OpenTelemetry
...
2024-08-28T20:33:34.617Z 2024-08-28T20:33:34.617Z undefined DEBUG @opentelemetry/instrumentation-http Applying instrumentation patch for nodejs core module on require hook { module: 'http' }
2024-08-28T20:33:34.623Z 2024-08-28T20:33:34.623Z undefined DEBUG @opentelemetry/instrumentation-aws-lambda Applying instrumentation patch for nodejs module file on require hook { module: '/var/task/index.js', version: '0.0.0', fileName: 'index', baseDir: '/var/task' }
2024-08-28T20:33:34.623Z 2024-08-28T20:33:34.623Z undefined DEBUG patch handler function
2024-08-28T20:33:34.637Z EXTENSION Name: collector State: Ready Events: [SHUTDOWN, INVOKE]
2024-08-28T20:33:34.638Z START RequestId: 6fe1e4b8-c70c-4c31-9ff9-084fe62f8897 Version: $LATEST
2024-08-28T20:33:34.641Z 2024-08-28T20:33:34.641Z 6fe1e4b8-c70c-4c31-9ff9-084fe62f8897 INFO Serving lambda request.
Time diff between collector logs: 618 ms
2024-08-28T20:33:34.019Z {"level":"info","ts":1724877214.019237,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:33:34.637Z EXTENSION Name: collector State: Ready Events: [SHUTDOWN, INVOKE]
No OTEL SDK + No User Handler Init
2024-08-28T20:45:43.734Z {"level":"info","ts":1724877943.7347512,"caller":"[email protected]/otlp.go:102","msg":"Starting GRPC server","kind":"receiver","name":"otlp","data_type":"traces","endpoint":"localhost:4317"}
2024-08-28T20:45:43.734Z {"level":"info","ts":1724877943.7348504,"caller":"[email protected]/otlp.go:152","msg":"Starting HTTP server","kind":"receiver","name":"otlp","data_type":"traces","endpoint":"localhost:4318"}
2024-08-28T20:45:43.734Z {"level":"info","ts":1724877943.7349138,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:45:43.735Z {"level":"info","ts":1724877943.7349257,"caller":"localhostgate/featuregate.go:63","msg":"The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. Disable the feature gate to temporarily revert to the previous default.","feature gate ID":"component.UseLocalHostAsDefaultHost"}
2024-08-28T20:45:43.863Z EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN]
2024-08-28T20:45:43.864Z START RequestId: 8e450b50-32d2-42aa-8f38-34ffebf46032 Version: $LATEST
2024-08-28T20:45:43.866Z 2024-08-28T20:45:43.866Z 8e450b50-32d2-42aa-8f38-34ffebf46032 INFO Serving lambda request.
Time diff between collector logs: 129 ms
2024-08-28T20:45:43.734Z {"level":"info","ts":1724877943.7349138,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:45:43.863Z EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN]
So, in summary, as you can see from the logs, time diff between collector logs are heavily depends on OTEL SDK init and user handler init delays. So, in your example, since handler itself is just simple hello world function without initializing any 3rd party dependency, I think, 2-4 second diff between collector logs is caused by OTEL Python SDK initialization.
I will also rerun this benchmark with sample Python Lambda and share the results. I am not a Python expert but maybe this type of auto instrumentation is heavyweight for Python runtime: https://github.com/open-telemetry/opentelemetry-lambda/blob/1634f8d4ebe8d9aad56f9bea9ea4932badcdde81/python/src/otel/otel_sdk/otel-instrument#L162
@serkan-ozal Much Thanks, I really appreciate your help. qq - what config of the collector are you running with? Is it the default one
I am gathering evidences on my end. Will share soon.
@bhaskarbanerjee default one
@bhaskarbanerjee I am not sure but such type of auto instrumentation (checks the dependencies and forks another process from main one to run the real app) seems not lightweight to me: https://github.com/open-telemetry/opentelemetry-python-contrib/blob/main/opentelemetry-instrumentation/src/opentelemetry/instrumentation/auto_instrumentation/init.py
Maybe this is a question to be asked to OTEL Python folks
@bhaskarbanerjee I think my theory about EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN] log (I have mentioned in the SIG meeting) is correct. As far as I see, that log is not printed whenever extension is ready, but when both the runtime and function are initialized. To verify this I think, we need to add some more debug logs to the collector especially right after it calls the /next
Agreed to both your points. I have already added logs to find out more about several other inits and calls. I will share those tests after I run the scenarios we discussed this morning. But so far with your and my results, the culprit seems be the sdk, especially AUTO instrumentation.
@serkan-ozal Here are my test results with lambda auto instrumented in NodeJs v18
Conclusion same as yours.
SDK Added
SDK Enabled
DecoupledProcessor
OTLPHTTP Exporter
Wait Time (ms)
Total Cold Start (ms)
✖
✖
✅
✅
145
918
✅
✖
✅
✅
174
1030
✅
✅
✅
✅
1684
1740
✅
✖
✅
✖
184
1440
✖
✖
✅
✖
147
803
✖
✖
✖
✖
176
844
✖
✖
✖
✅
149
746
Inference:
Addition of the SDK introduces high cold start time
Enabling the SDK in AUTO INSTRUMENTATION mode further increases the cold start.
Thanks for the detailed benchmark @bhaskarbanerjee.
I have also reproduced similar cold start delays with Python example too and verified that Python auto instrumentation is the reason. Will look into further tomorrow by profiling Python auto instrumentation script.
But, initially as far as I see from the Python auto instrumentation, it
- starts user handler in child process
- and scans all the Python path in the disk to discover auto instrumentation libraries (there are many)
- and checks version conflicts between OTEL dependencies
- then initializes instrumenters along with their pre and post hooks
autoinstrumentation is definitely a cold start penalty in JS. the more instrumentations are enabled, the higher the penalty. This is because we wrap require and intercept all module loads to determine if it is a module that needs to be monkeypatched. It used to be much worse when each instrumentation maintained its own wrapper. Now there is a single global wrapper that instrumentations register with. There are maybe additional optimizations that can be done, but the mechanism will always have some module load penalty.
edit: tl;dr because node autoinstrumentation uses require-in-the-middle, each module load is slightly slower than it would be without autoinstrumentation enabled
Thanks @dyladan. Should choosing the instrumentations lower the cold starts. Ref: OTEL_NODE_RESOURCE_DETECTORS from https://opentelemetry.io/docs/zero-code/js/ OTEL_PYTHON_DISABLED_INSTRUMENTATIONS from https://opentelemetry.io/docs/zero-code/python/configuration/#disabling-specific-instrumentations
Or are there any other ways apart from manual instrumentation?
Do we have any tests conducted/documented by the community to determine the performance difference between AUTO instrumentation and MANUAL instrumentation?
@bhaskarbanerjee
For the Python runtime, I have analyzed OTEL Python auto instrumentation script and verified that the following parts have significant effect on coldstart (for a 512 MB Lambda function):
- auto discovery process: 750-1000 ms
botocoreinstrumentation: 350-400 ms
I am not sure whether there is anything we can do for botocore instrumentation, because it is caused by the import of some botocore modules by the wrapt which is the auto instrumentation library for Python.
But instead of auto discovery of the OTEL packages, I have switched to static list of instrumentation packages like we already do in Node.js layer and it helped me to cut 750-1000 ms from coldstart delay.
If you are also interested in to test my optimized layer (which is experimental and please don't use it on prod :) ), I can share the ARN of it, so you also can benchmark it too.
According to my coldstart analysis, I have created these issues in OTEL Python repository to discuss potential solutions:
- https://github.com/open-telemetry/opentelemetry-python-contrib/issues/2843
- https://github.com/open-telemetry/opentelemetry-python/issues/4171
HI @serkan-ozal Sorry, I was out for a few days.
Can you please share which instrumentations have you included? I have tried setting OTEL_PYTHON_DISABLED_INSTRUMENTATIONS to none but it didn't yield much savings. So I would like to try out what you have tried.
But instead of auto discovery of the OTEL packages, I have switched to static list of instrumentation packages like we already do in Node.js layer and it helped me to cut 750-1000 ms from coldstart delay.
This issue was marked stale. It will be closed in 30 days without additional activity.
Closed as inactive. Feel free to reopen if this issue is still relevant.