Collector extension layer exits, aborts execution when run in local container of a container image Lambda
Describe the bug
I am trying to use the OpenTelemetry Lambda layers in a Container image. While this functionality is not currently documented, it seems reasonable to expect it to work, as the Lambda layers are just unzipped into the /opt directory for ZIP file distributions.
The OpenTelemetry Lambda Collector extension fails when running a Lambda container locally with Docker and invoking it with an event. The Lambda aborts execution and the Collector fails with the following error:
{"level":"fatal","ts":1749391498.7609231,"msg":"Cannot start Telemetry API Listener","error":"failed to find available port: listen tcp: lookup sandbox.localdomain on 127.0.0.11:53: no such host"}
Steps to reproduce
I have created a repository to reproduce the issue at https://github.com/gotgenes/lambda-opentelemetry-docker.
The repository includes the following:
- A Node.js Lambda function implemented in TypeScript.
- A Dockerfile to build the Lambda container image from the Node.js v22 Lambda base image with the OpenTelemetry Lambda layers.
- A Docker Compose file to run the Lambda container locally with Docker, along with an otel-tui sidecar container to view the telemetry.
- A CDK app to create the ECR repository and deploy the Lambda function.
The steps to reproduce the issue are as follows:
-
Clone the repository.
-
Set the environment variables:
export AWS_PROFILE=$YOUR_PROFILE export COMPOSE_BAKE=true -
Build and start the Lambda container with Docker Compose:
docker compose up --build -
In a separate terminal session, invoke the Lambda function:
curl -XPOST -i -d '{}' http://localhost:9000/2015-03-31/functions/function/invocations -
Observe the logs in the terminal where Docker Compose is running.
Please see the repository's README for detailed instructions.
What did you expect to see?
I expected the OpenTelemetry Lambda Collector extension to start successfully and the Lambda function to execute without errors, allowing me to see trace data in the otel-tui sidecar container, and for the curl command to receive a successful response.
What did you see instead?
I get a 502 response from the curl command, and the following logs in the terminal where Docker Compose is running:
2025-06-08 10:01:57.408 | 08 Jun 2025 14:01:57,408 [INFO] (rapid) exec '/var/runtime/bootstrap' (cwd=/var/task, handler=)
2025-06-08 10:04:58.719 | 08 Jun 2025 14:04:58,719 [INFO] (rapid) INIT START(type: on-demand, phase: init)
2025-06-08 10:04:58.719 | START RequestId: d2de0040-511a-4348-bf8f-b590e057ee0c Version: $LATEST
2025-06-08 10:04:58.752 | {"level":"info","ts":1749391498.7526667,"msg":"Launching OpenTelemetry Lambda extension","version":"v0.126.0"}
2025-06-08 10:04:58.754 | 08 Jun 2025 14:04:58,754 [INFO] (rapid) External agent collector (3dcc14c5-3f8b-4004-b7c5-556b5deea231) registered, subscribed to [INVOKE SHUTDOWN]
2025-06-08 10:04:58.761 | {"level":"fatal","ts":1749391498.7609231,"msg":"Cannot start Telemetry API Listener","error":"failed to find available port: listen tcp: lookup sandbox.localdomain on 127.0.0.11:53: no such host"}
2025-06-08 10:04:58.762 | 08 Jun 2025 14:04:58,762 [WARNING] (rapid) First fatal error stored in appctx: Extension.Crash
2025-06-08 10:04:58.762 | 08 Jun 2025 14:04:58,762 [WARNING] (rapid) Process extension-collector-1 exited: exit status 1
2025-06-08 10:04:58.762 | 08 Jun 2025 14:04:58,762 [INFO] (rapid) INIT RTDONE(status: error)
2025-06-08 10:04:58.762 | 08 Jun 2025 14:04:58,762 [INFO] (rapid) INIT REPORT(durationMs: 42.785000)
2025-06-08 10:04:58.762 | 08 Jun 2025 14:04:58,762 [ERROR] (rapid) Init failed error=exit status 1 InvokeID=
2025-06-08 10:04:58.762 | 08 Jun 2025 14:04:58,762 [WARNING] (rapid) Shutdown initiated: spindown
2025-06-08 10:04:58.762 | 08 Jun 2025 14:04:58,762 [INFO] (rapid) Waiting for runtime domain processes termination
2025-06-08 10:04:58.762 | 08 Jun 2025 14:04:58,762 [INFO] (rapid) INIT START(type: on-demand, phase: invoke)
2025-06-08 10:04:58.762 | 08 Jun 2025 14:04:58,762 [INFO] (rapid) INIT REPORT(durationMs: 0.051000)
2025-06-08 10:04:58.762 | 08 Jun 2025 14:04:58,762 [INFO] (rapid) INVOKE START(requestId: bdacc1f2-3764-4f4b-b6af-7fa918202024)
2025-06-08 10:04:58.762 | 08 Jun 2025 14:04:58,762 [ERROR] (rapid) Invoke failed error=ErrAgentNameCollision InvokeID=bdacc1f2-3764-4f4b-b6af-7fa918202024
2025-06-08 10:04:58.762 | 08 Jun 2025 14:04:58,762 [ERROR] (rapid) Invoke DONE failed: Sandbox.Failure
2025-06-08 10:04:58.762 | 08 Jun 2025 14:04:58,762 [WARNING] (rapid) Reset initiated: ReleaseFail
2025-06-08 10:04:58.762 | 08 Jun 2025 14:04:58,762 [WARNING] (rapid) The runtime was not started.
2025-06-08 10:04:58.762 | 08 Jun 2025 14:04:58,762 [WARNING] (rapid) Agent collector (3dcc14c5-3f8b-4004-b7c5-556b5deea231) failed to launch, therefore skipping shutting it down.
2025-06-08 10:04:58.762 | 08 Jun 2025 14:04:58,762 [INFO] (rapid) Waiting for runtime domain processes termination
2025-06-08 10:07:26.750 | 08 Jun 2025 14:07:26,750 [INFO] (rapid) Received signal signal=terminated
2025-06-08 10:07:26.750 | 08 Jun 2025 14:07:26,750 [INFO] (rapid) Shutting down...
2025-06-08 10:07:26.750 | 08 Jun 2025 14:07:26,750 [WARNING] (rapid) Reset initiated: SandboxTerminated
2025-06-08 10:07:26.750 | 08 Jun 2025 14:07:26,750 [INFO] (rapid) Waiting for runtime domain processes termination
What version of collector/language SDK version did you use?
- Collector extension layer version:
v0.15.0 - Node.js layer version:
v0.14.0
What language layer did you use?
JavaScript/Node.js (implemented in TypeScript)
Additional context
While I appreciate this project providing extension layers for the collector and language SDKs that can be accessed for ZIP file distributions, it seems to me that AWS wants to push users towards using container images for Lambda functions. Therefore, I think it would be beneficial to support the OpenTelemetry Lambda layers in container images as well. That would look like also providing base images with the OpenTelemetry Lambda layers included, or at least documenting how to use the layers in a custom Dockerfile. The purpose of my repository is to demonstrate how to do that, but it currently does not work.
The local setup does not have the Lambda runtime api available. Tools like AWS SAM are using this emulator for local setups https://github.com/aws/aws-lambda-runtime-interface-emulator
The Collector tries to register at the lambda extension api https://docs.aws.amazon.com/lambda/latest/dg/runtimes-extensions-api.html which is not available.
The Endpoint of the extension api can be configured using an env var AWS_LAMBDA_RUNTIME_API
https://github.com/open-telemetry/opentelemetry-lambda/blob/759d5792893e9169f03c905dedc96aa10ed234a3/collector/internal/telemetryapi/client.go#L48C64-L48C86
Yes, AWS Lambda RIE doesn't support Telemetry API yet: https://github.com/aws/aws-lambda-runtime-interface-emulator/issues/94
And the good thing is that OTEL Lambda Collector extension doesn't have hard dependency to Telemetry API. Therefore, I think, we should introduce a configuration to be able to disable (enabled by default) Telemetry API support in the OTEL Lambda Collector extension.
Per our discussion in the FaaS SIG meeting, @maxday will be owning this issue.
@maxday I am not able to assign this issue as your name is not shown up in the Assignees section.
@serkan-ozal According to the doc (ref: https://docs.github.com/en/issues/tracking-your-work-with-issues/using-issues/assigning-issues-and-pull-requests-to-other-github-users#about-issue-and-pull-request-assignees) I would need read permission (as I am already a member of the opentelemetry org). So I think you need to add me as a collaborator with read access. Let me know!
@RaphaelManke @serkan-ozal @maxday Thank you so much for taking the time to look into this issue.
I found a way to run locally with the instrumented layer only (directly exporting to an otel-tui sidecar container), however, the more consistent I can get local and production, the better. It sounds like having a way for the skip past the Telemetry API would help users run the collector extension locally.
Am I understanding correctly that, even once the proposed Telemetry API-skip gets implemented, users would still need to include the runtime interface emulator that @RaphaelManke recommended in order to manage the extension lifecycle when executing locally?
Hey @maxday - is there any progress on this issue or a workaround? As far as I can see, incorporating the OpenTelemetry lambda layer in a container image essentially disallows running the container locally.
Essentially, I'm hoping for a config switch which removes the dependency on the telemetry API.