opentelemetry-lambda Observing delays in otel collector extension

Describe the bug We are using the otel collector in decouple mode and observing that the extension is taking > 2s to start. On further investigation, we are noticing a delay of ~1.5-2s between the log statements Everything is ready. Begin running and processing data. and EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN]

Is so much delay expected?

Aug 23 '24 19:08 bhaskarbanerjee

Hi @bhaskarbanerjee, is it possible to set OTEL_LOG_LEVEL env var to true and share the debug logs with us.

Aug 26 '24 18:08 serkan-ozal

Thanks for quick reponse @serkan-ozal. Here it is otel-lambda-collector.log

Aug 26 '24 18:08 bhaskarbanerjee

2024-08-26T13:39:55.810-05:00 | {"level":"warn","ts":1724697595.8106759,"caller":"localhostgate/featuregate.go:63","msg":"The default endpoints for all servers in components will change to use localhost instead of 0.0.0.0 in a future version. Use the feature gate to preview the new default."",""feature gate ID"":""component.UseLocalHostAsDefaultHost""}

THIS IS WHERE WE SEE A DELAY OF ~2-4 seconds.

2024-08-26T13:39:57.228-05:00 | EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN] 2024-08-26T13:39:57.229-05:00 | START RequestId: 6a59d6bb-e5ce-46d7-9b70-37069cf7f12a Version: $LATEST

Aug 26 '24 18:08 bhaskarbanerjee

Here is my config.

  otlp:
    protocols:
      http:
        endpoint: "localhost:4318"
 
 
exporters:
  logging:
    loglevel: info
  otlphttp:
    endpoint: ${OTEL_GATEWAY_ENDPOINT}
    retry_on_failure:
      initial_interval: ${INITIAL_INTERVAL}
      max_interval: ${MAX_INTERVAL}
      max_elapsed_time: ${MAX_ELAPSED_TIME}
    timeout: ${TRANSPORT_TIMEOUT}
    sending_queue:
      queue_size: 5000
 
processors:
  decouple:
    max_queue_size: 5000
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [logging, otlphttp]
      processors: [decouple]
    metrics:
      receivers: [otlp]
      exporters: [logging, otlphttp]
      processors: [decouple]

Aug 26 '24 18:08 bhaskarbanerjee

@bhaskarbanerjee While I am not fully sure at the moment, I think there is Lambda runtime and function init (with OTEL SDK init) between Everything is ready. Begin running and processing data. and EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN] logs.

In my simple example, I have seen runtime and function init logs between those collector extension logs (~1.5s).

To check this, you may also put print debug log statement to the first and last lines (and also some other important points) of handler.py file to measure the load start/end timings of your handler module/file (not the handler function, but the handler file)

Aug 26 '24 19:08 serkan-ozal

Thanks @serkan-ozal . I have other lambdas too which are not using opentelemetry and in those cases, there is no such wait time. here is an example of using New Relic layer and another layer from Hashicorp in the same lambda.

2024-08-28T09:42:32.881-05:00EXTENSION	Name: newrelic-lambda-extension	State: Ready	Events: [INVOKE, SHUTDOWN] | 
EXTENSION Name: newrelic-lambda-extension State: Ready Events: [INVOKE, SHUTDOWN]

2024-08-28T09:42:32.881-05:00EXTENSION	Name: vault-lambda-extension	State: Ready	Events: [INVOKE, SHUTDOWN] | 
EXTENSION Name: vault-lambda-extension State: Ready Events: [INVOKE, SHUTDOWN]

2024-08-28T09:42:32.886-05:00START RequestId: 81648590-a693-4abb-b0e4-26b0a067a0f8 Version: $LATEST

In this case the Lambda lifecycle API shows it took 5ms for it to start since the extensions came to ready state, which is quite acceptable.

Aug 28 '24 14:08 bhaskarbanerjee

@bhaskarbanerjee

Here are the logs from my sample Nodejs Lambda:

With OTEL SDK + User Handler Init

2024-08-28T20:55:31.775Z {"level":"info","ts":1724878531.7748003,"caller":"[email protected]/otlp.go:102","msg":"Starting GRPC server","kind":"receiver","name":"otlp","data_type":"traces","endpoint":"localhost:4317"}
2024-08-28T20:55:31.775Z {"level":"info","ts":1724878531.7753932,"caller":"[email protected]/otlp.go:152","msg":"Starting HTTP server","kind":"receiver","name":"otlp","data_type":"traces","endpoint":"localhost:4318"}
2024-08-28T20:55:31.775Z {"level":"info","ts":1724878531.775681,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:55:31.775Z {"level":"info","ts":1724878531.7757266,"caller":"localhostgate/featuregate.go:63","msg":"The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. Disable the feature gate to temporarily revert to the previous default.","feature gate ID":"component.UseLocalHostAsDefaultHost"}
2024-08-28T20:55:32.249Z

Registering OpenTelemetry

...

2024-08-28T20:55:33.288Z EXTENSION Name: collector State: Ready Events: [SHUTDOWN, INVOKE]
2024-08-28T20:55:33.290Z START RequestId: 5ef7f288-ba75-4637-b44b-3ae9cd4ccf13 Version: $LATEST
2024-08-28T20:55:33.296Z 2024-08-28T20:55:33.296Z 5ef7f288-ba75-4637-b44b-3ae9cd4ccf13 INFO Serving lambda request.

Time diff between collector logs: 1,513 ms

2024-08-28T20:55:31.775Z {"level":"info","ts":1724878531.775681,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:55:33.288Z EXTENSION Name: collector State: Ready Events: [SHUTDOWN, INVOKE]

With OTEL SDK + No User Handler Init

2024-08-28T20:33:34.018Z {"level":"info","ts":1724877214.018901,"caller":"[email protected]/otlp.go:102","msg":"Starting GRPC server","kind":"receiver","name":"otlp","data_type":"metrics","endpoint":"localhost:4317"}
2024-08-28T20:33:34.019Z {"level":"info","ts":1724877214.0191104,"caller":"[email protected]/otlp.go:152","msg":"Starting HTTP server","kind":"receiver","name":"otlp","data_type":"metrics","endpoint":"localhost:4318"}
2024-08-28T20:33:34.019Z {"level":"info","ts":1724877214.019237,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:33:34.019Z {"level":"info","ts":1724877214.0193567,"caller":"localhostgate/featuregate.go:63","msg":"The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. Disable the feature gate to temporarily revert to the previous default.","feature gate ID":"component.UseLocalHostAsDefaultHost"}

2024-08-28T20:33:34.447Z Registering OpenTelemetry

...

2024-08-28T20:33:34.617Z 2024-08-28T20:33:34.617Z undefined DEBUG @opentelemetry/instrumentation-http Applying instrumentation patch for nodejs core module on require hook { module: 'http' }
2024-08-28T20:33:34.623Z 2024-08-28T20:33:34.623Z undefined DEBUG @opentelemetry/instrumentation-aws-lambda Applying instrumentation patch for nodejs module file on require hook { module: '/var/task/index.js', version: '0.0.0', fileName: 'index', baseDir: '/var/task' }
2024-08-28T20:33:34.623Z 2024-08-28T20:33:34.623Z undefined DEBUG patch handler function
2024-08-28T20:33:34.637Z EXTENSION Name: collector State: Ready Events: [SHUTDOWN, INVOKE]
2024-08-28T20:33:34.638Z START RequestId: 6fe1e4b8-c70c-4c31-9ff9-084fe62f8897 Version: $LATEST
2024-08-28T20:33:34.641Z 2024-08-28T20:33:34.641Z 6fe1e4b8-c70c-4c31-9ff9-084fe62f8897 INFO Serving lambda request.

Time diff between collector logs: 618 ms

2024-08-28T20:33:34.019Z {"level":"info","ts":1724877214.019237,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:33:34.637Z EXTENSION Name: collector State: Ready Events: [SHUTDOWN, INVOKE]

No OTEL SDK + No User Handler Init

2024-08-28T20:45:43.734Z {"level":"info","ts":1724877943.7347512,"caller":"[email protected]/otlp.go:102","msg":"Starting GRPC server","kind":"receiver","name":"otlp","data_type":"traces","endpoint":"localhost:4317"}
2024-08-28T20:45:43.734Z {"level":"info","ts":1724877943.7348504,"caller":"[email protected]/otlp.go:152","msg":"Starting HTTP server","kind":"receiver","name":"otlp","data_type":"traces","endpoint":"localhost:4318"}
2024-08-28T20:45:43.734Z {"level":"info","ts":1724877943.7349138,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:45:43.735Z {"level":"info","ts":1724877943.7349257,"caller":"localhostgate/featuregate.go:63","msg":"The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. Disable the feature gate to temporarily revert to the previous default.","feature gate ID":"component.UseLocalHostAsDefaultHost"}
2024-08-28T20:45:43.863Z EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN]
2024-08-28T20:45:43.864Z START RequestId: 8e450b50-32d2-42aa-8f38-34ffebf46032 Version: $LATEST
2024-08-28T20:45:43.866Z 2024-08-28T20:45:43.866Z 8e450b50-32d2-42aa-8f38-34ffebf46032 INFO Serving lambda request.

Time diff between collector logs: 129 ms

2024-08-28T20:45:43.734Z {"level":"info","ts":1724877943.7349138,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:45:43.863Z EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN]

So, in summary, as you can see from the logs, time diff between collector logs are heavily depends on OTEL SDK init and user handler init delays. So, in your example, since handler itself is just simple hello world function without initializing any 3rd party dependency, I think, 2-4 second diff between collector logs is caused by OTEL Python SDK initialization.

I will also rerun this benchmark with sample Python Lambda and share the results. I am not a Python expert but maybe this type of auto instrumentation is heavyweight for Python runtime: https://github.com/open-telemetry/opentelemetry-lambda/blob/1634f8d4ebe8d9aad56f9bea9ea4932badcdde81/python/src/otel/otel_sdk/otel-instrument#L162

Aug 28 '24 21:08 serkan-ozal

@serkan-ozal Much Thanks, I really appreciate your help. qq - what config of the collector are you running with? Is it the default one

I am gathering evidences on my end. Will share soon.

Aug 28 '24 21:08 bhaskarbanerjee

@bhaskarbanerjee default one

Aug 28 '24 21:08 serkan-ozal

@bhaskarbanerjee I am not sure but such type of auto instrumentation (checks the dependencies and forks another process from main one to run the real app) seems not lightweight to me: https://github.com/open-telemetry/opentelemetry-python-contrib/blob/main/opentelemetry-instrumentation/src/opentelemetry/instrumentation/auto_instrumentation/init.py

Maybe this is a question to be asked to OTEL Python folks

Aug 28 '24 21:08 serkan-ozal

@bhaskarbanerjee I think my theory about EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN] log (I have mentioned in the SIG meeting) is correct. As far as I see, that log is not printed whenever extension is ready, but when both the runtime and function are initialized. To verify this I think, we need to add some more debug logs to the collector especially right after it calls the /next

Aug 28 '24 21:08 serkan-ozal

Agreed to both your points. I have already added logs to find out more about several other inits and calls. I will share those tests after I run the scenarios we discussed this morning. But so far with your and my results, the culprit seems be the sdk, especially AUTO instrumentation.

Aug 28 '24 21:08 bhaskarbanerjee

@serkan-ozal Here are my test results with lambda auto instrumented in NodeJs v18 Conclusion same as yours.

SDK Added SDK Enabled DecoupledProcessor OTLPHTTP Exporter Wait Time (ms) Total Cold Start (ms)

✖ ✖ ✅ ✅ 145 918

✅ ✖ ✅ ✅ 174 1030

✅ ✅ ✅ ✅ 1684 1740

✅ ✖ ✅ ✖ 184 1440

✖ ✖ ✅ ✖ 147 803

✖ ✖ ✖ ✖ 176 844

✖ ✖ ✖ ✅ 149 746

SDK Added	SDK Enabled	DecoupledProcessor	OTLPHTTP Exporter	Wait Time (ms)	Total Cold Start (ms)
✖	✖	✅	✅	145	918
✅	✖	✅	✅	174	1030
✅	✅	✅	✅	1684	1740
✅	✖	✅	✖	184	1440
✖	✖	✅	✖	147	803
✖	✖	✖	✖	176	844
✖	✖	✖	✅	149	746

Inference:

Addition of the SDK introduces high cold start time
Enabling the SDK in AUTO INSTRUMENTATION mode further increases the cold start.

Aug 29 '24 18:08 bhaskarbanerjee

Thanks for the detailed benchmark @bhaskarbanerjee.

I have also reproduced similar cold start delays with Python example too and verified that Python auto instrumentation is the reason. Will look into further tomorrow by profiling Python auto instrumentation script.

But, initially as far as I see from the Python auto instrumentation, it

starts user handler in child process
and scans all the Python path in the disk to discover auto instrumentation libraries (there are many)
and checks version conflicts between OTEL dependencies
then initializes instrumenters along with their pre and post hooks

Aug 29 '24 20:08 serkan-ozal

autoinstrumentation is definitely a cold start penalty in JS. the more instrumentations are enabled, the higher the penalty. This is because we wrap require and intercept all module loads to determine if it is a module that needs to be monkeypatched. It used to be much worse when each instrumentation maintained its own wrapper. Now there is a single global wrapper that instrumentations register with. There are maybe additional optimizations that can be done, but the mechanism will always have some module load penalty.

edit: tl;dr because node autoinstrumentation uses require-in-the-middle, each module load is slightly slower than it would be without autoinstrumentation enabled

Aug 29 '24 20:08 dyladan

Thanks @dyladan. Should choosing the instrumentations lower the cold starts. Ref: OTEL_NODE_RESOURCE_DETECTORS from https://opentelemetry.io/docs/zero-code/js/ OTEL_PYTHON_DISABLED_INSTRUMENTATIONS from https://opentelemetry.io/docs/zero-code/python/configuration/#disabling-specific-instrumentations

Or are there any other ways apart from manual instrumentation?

Do we have any tests conducted/documented by the community to determine the performance difference between AUTO instrumentation and MANUAL instrumentation?

Aug 30 '24 02:08 bhaskarbanerjee

@bhaskarbanerjee

For the Python runtime, I have analyzed OTEL Python auto instrumentation script and verified that the following parts have significant effect on coldstart (for a 512 MB Lambda function):

auto discovery process: 750-1000 ms
botocore instrumentation: 350-400 ms

I am not sure whether there is anything we can do for botocore instrumentation, because it is caused by the import of some botocore modules by the wrapt which is the auto instrumentation library for Python.

But instead of auto discovery of the OTEL packages, I have switched to static list of instrumentation packages like we already do in Node.js layer and it helped me to cut 750-1000 ms from coldstart delay.

If you are also interested in to test my optimized layer (which is experimental and please don't use it on prod :) ), I can share the ARN of it, so you also can benchmark it too.

Aug 30 '24 19:08 serkan-ozal

According to my coldstart analysis, I have created these issues in OTEL Python repository to discuss potential solutions:

https://github.com/open-telemetry/opentelemetry-python-contrib/issues/2843
https://github.com/open-telemetry/opentelemetry-python/issues/4171

Aug 31 '24 09:08 serkan-ozal

HI @serkan-ozal Sorry, I was out for a few days.

Can you please share which instrumentations have you included? I have tried setting OTEL_PYTHON_DISABLED_INSTRUMENTATIONS to none but it didn't yield much savings. So I would like to try out what you have tried.

But instead of auto discovery of the OTEL packages, I have switched to static list of instrumentation packages like we already do in Node.js layer and it helped me to cut 750-1000 ms from coldstart delay.

Sep 03 '24 13:09 bhaskarbanerjee

This issue was marked stale. It will be closed in 30 days without additional activity.

Sep 04 '25 03:09 github-actions[bot]

Closed as inactive. Feel free to reopen if this issue is still relevant.

Oct 05 '25 03:10 github-actions[bot]

opentelemetry-lambda opentelemetry-lambda copied to clipboard

Observing delays in otel collector extension

With OTEL SDK + User Handler Init

With OTEL SDK + No User Handler Init

No OTEL SDK + No User Handler Init

opentelemetry-lambda
opentelemetry-lambda copied to clipboard