opentelemetry-lambda ADOT collector sends response to Runtime API instead of waiting for sending traces

Hi!

After a discussion on Twitter, I realized that the ADOT collector extension doesn't wait for data to be sent before making the next call to the Lambda Runtime API.

Here's the section of the code that loops on next events: https://github.com/open-telemetry/opentelemetry-lambda/blob/main/collector/main.go#L71

This can cause issues when the destination for the collector is slow and the Lambda function itself finishes in a short period of time. In this situation, the collector could still be sending data while Lambda freezes the execution environment. I've drawn a timing diagram if it helps:

extension_timing drawio

Mar 09 '22 12:03 nmoutschen

Correct something. This code does not make sense, remove it or not does not impact anything :) I put it here just because copy from Lambda sample. Lambda Collector extension does not process any Lambda event. The receiver gets incoming request from OTel SDK in runtime directly then forward to backend by exporter in the same thread. Means Lambda Runtime would be hang there until Collector extension forward the data to backend, Lambda would not freeze or shutdown before this process is completed. In design doc we emphasize this can be achieved by NOT configure batching processor between receiver and exporter. https://github.com/open-telemetry/opentelemetry-lambda/blob/main/docs/design_proposal.md#21-lambda-freeze

So, I would ask 2 questions:

Is this issue only happened in nodejs layer?
Any asynchronous logic get involved in recent collector otlp exporter?

Mar 15 '22 00:03 wangzlei

I'm not using the Batching Processor, but rather the SimpleSpanProcessor, but even though I can call forceFlush(), as you can see, it does not wait on what it has sent (it just sends it async), so perhaps I need a custom, synchronous span processor that waits on everything it gives to the receiver:

https://github.com/open-telemetry/opentelemetry-js/blob/main/packages/opentelemetry-sdk-trace-base/src/export/SimpleSpanProcessor.ts

Interestingly, the batching processor does seem to at least try and synchronise on forceFlush() / shutdown: https://github.com/open-telemetry/opentelemetry-js/blob/main/packages/opentelemetry-sdk-trace-base/src/export/BatchSpanProcessorBase.ts

UPDATE: Using the BatchSpanProcessor in my Node.js code that uploads to the receiver, along with calling forceFlush() before the lambda exits, seems to be helping considerably. I'll explore more along these lines and update accordingly.

Mar 15 '22 02:03 chris-armstrong

We're experiencing the same issue.

The receiver gets incoming request from OTel SDK in runtime directly then forward to backend by exporter in the same thread. Means Lambda Runtime would be hang there until Collector extension forward the data to backend, Lambda would not freeze or shutdown before this process is completed.

That doesn't seem to be the case for us, even though we take care to call ForceFlush() from the lambda function itself (via the .NET SDK). If I enable debug level logging for the collector and run the function in the emulator, I can see that it logs received traces after the lambda runtime reports that the function has finished.

instance-1 | END RequestId: a261ba7c-ecda-488d-82f6-c4fdf0b3834c
instance-1 | REPORT RequestId: a261ba7c-ecda-488d-82f6-c4fdf0b3834c      Init Duration: 0.17 ms  Duration: 4662.45 ms
  Billed Duration: 4663 ms Memory Size: 3008 MB    Max Memory Used: 3008 MB
instance-1 | 2022-07-07T01:23:09.703Z    INFO    loggingexporter/logging_exporter.go:42  TracesExporter  {"#spans": 3}
instance-1 | 2022-07-07T01:23:09.704Z    DEBUG   loggingexporter/logging_exporter.go:51  ResourceSpans #0
instance-1 | Resource SchemaURL:
instance-1 | Resource labels:
instance-1 |      -> telemetry.sdk.name: STRING(opentelemetry)
instance-1 |      -> telemetry.sdk.language: STRING(dotnet)
instance-1 |      -> telemetry.sdk.version: STRING(1.3.0.519)
instance-1 |      -> cloud.provider: STRING(aws)
instance-1 |      -> faas.name: STRING(test_function)
instance-1 |      -> faas.version: STRING($LATEST)
instance-1 |      -> faas.instance: STRING($LATEST)
instance-1 |      -> service.name: STRING(MyService)
instance-1 |      -> service.namespace: STRING(MyNamespace)
instance-1 |      -> service.version: STRING(1.7.0.0)
instance-1 |      -> service.instance.id: STRING(cf6f73c3-41e0-45ab-b6a2-441b6bcaae94)
instance-1 | ScopeSpans #0
instance-1 | ...

This only seems to affect traces, not metrics or logs. Is there some way to debug further to determine whether this is an issue with the collector itself vs some failure of the client library to block until the traces have been flushed?

Edit: if I add a console exporter on the .NET SDK side I see all of the traces logged to the console correctly, before the lambda runtime reports that the function has finished. So it seems likely that the non-blocking behaviour is on the collector side.

Jul 07 '22 01:07 ja-nick

if I add a console exporter on the .NET SDK side I see all of the traces logged to the console correctly, before the lambda runtime reports that the function has finished. So it seems likely that the non-blocking behaviour is on the collector side.

Are you able to share the collector configuration you're using? there aren't any processors in the ADOT build that would introduce async behavior and I don't believe any of the exporters will do anything async without prompting.

Jul 22 '22 20:07 Aneurysm9

Thanks @Aneurysm9, our config looks like this:

receivers:
  otlp:
    protocols:
      grpc:

exporters:
  otlp:
    endpoint: https://otlp.nr-data.net:4317
    headers:
      api-key: ${NR_API_KEY}

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlp]
    metrics:
      receivers: [otlp]
      exporters: [otlp]
    logs:
      receivers: [otlp]
      exporters: [otlp]

Jul 24 '22 23:07 ja-nick

From what I can see the collector main is simply calling NextEvent in a loop. NextEvent() simply calls the Next API on the extension API.

According to the lambda extension API the Next API call is what signals to the runtime that the extension has finished running. This will cause the runtime to freeze until the next execution or until shutdown.

So if the collector starts exporting some data during an invocation, it should NOT call Next until the export has finished (or failed). Otherwise the runtime freezes and if not thawed again by another invocation the export will fail and spans will be dropped

Oct 10 '22 15:10 nerochiaro

I have created #305 to attempt to resolve this issue. I think the general approach is sound and it works in the test scenario I have constructed, but if anyone else who has experienced this issue is able to test a layer created from that branch it would be greatly appreciated.

Nov 01 '22 16:11 Aneurysm9

Awesome, thanks @Aneurysm9.

if anyone else who has experienced this issue is able to test a layer created from that branch it would be greatly appreciated

I've been testing that branch for a couple of days and it looks great so far.

edit: on further testing, I'm still seeing dropped spans similar to those being reported below.

Nov 04 '22 00:11 ja-nick

I am testing the branch and things seem to work better, but I have some strange results still.

Basically the exporter seems to successfully send the traces to honeycomb.io in all cases that I can test (especially in case of infrequent and fast lambda runs, which were the problem before). I can see all traces reliably appearing in the UI at honeycomb.

But there is still something wrong when the lambda runtime shuts down. I get something like this:

2022-11-04T09:22:04.881000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 {"level":"info","msg":"Waiting for quiescence","active_spans":1}
2022-11-04T09:22:04.881000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 2022-11-04T09:22:04.881Z	info	TracesExporter	{"kind": "exporter", "data_type": "traces", "name": "logging", "#spans": 2}
2022-11-04T09:22:04.882000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 2022-11-04T09:22:04.882Z	info	ResourceSpans #0
2022-11-04T09:22:04.882000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 Resource SchemaURL:

[... trace data omitted for brevity ...]

2022-11-04T09:22:04.882000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 -> znd.state_in.phase: Str()
2022-11-04T09:22:04.882000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 {"kind": "exporter", "data_type": "traces", "name": "logging"}
2022-11-04T09:22:04.883000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 {"level":"debug","msg":"Waiting for event..."}
2022-11-04T09:22:04.901000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 2022-11-04T09:22:04.883Z	e4da96a0-5d98-472d-82fa-e5344b58a0e7	DEBUG	statusCode: 200
2022-11-04T09:22:04.922000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 END RequestId: e4da96a0-5d98-472d-82fa-e5344b58a0e7
2022-11-04T09:22:04.922000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 REPORT RequestId: e4da96a0-5d98-472d-82fa-e5344b58a0e7	Duration: 998.70 ms	Billed Duration: 999 ms	Memory Size: 150 MB	Max Memory Used: 140 MB	Init Duration: 1167.82 ms

[ ... here AWS shuts down the lambda runtime ... ]

2022-11-04T09:28:55.508000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 2022-11-04T09:28:55.493Z	error	exporterhelper/queued_retry.go:176	Exporting failed. No more retries left. Dropping data.	{"kind": "exporter", "data_type": "traces", "name": "otlp", "error": "max elapsed time expired rpc error: code = DeadlineExceeded desc = context deadline exceeded", "dropped_items": 2}
2022-11-04T09:28:55.508000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).onTemporaryFailure
2022-11-04T09:28:55.508000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:176
2022-11-04T09:28:55.508000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
2022-11-04T09:28:55.508000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:411
2022-11-04T09:28:55.508000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send
2022-11-04T09:28:55.508000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 go.opentelemetry.io/[email protected]/exporter/exporterhelper/traces.go:134
2022-11-04T09:28:55.508000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
2022-11-04T09:28:55.508000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:206
2022-11-04T09:28:55.508000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1
2022-11-04T09:28:55.508000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:61
2022-11-04T09:28:55.529000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 2022-11-04T09:28:55.529Z	info	service/collector.go:196	Received shutdown request
2022-11-04T09:28:55.529000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 {"level":"debug","msg":"Received ","event :":"{\n\t\"eventType\": \"SHUTDOWN\",\n\t\"deadlineMs\": 1667554137493,\n\t\"requestId\": \"\",\n\t\"invokedFunctionArn\": \"\",\n\t\"tracing\": {\n\t\t\"type\": \"\",\n\t\t\"value\": \"\"\n\t}\n}"}
2022-11-04T09:28:55.529000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 2022-11-04T09:28:55.529Z	info	service/service.go:116	Starting shutdown...
2022-11-04T09:28:55.529000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 2022-11-04T09:28:55.529Z	info	pipelines/pipelines.go:118	Stopping receivers...
2022-11-04T09:28:55.529000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 2022-11-04T09:28:55.529Z	info	pipelines/pipelines.go:125	Stopping processors...
2022-11-04T09:28:55.529000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 2022-11-04T09:28:55.529Z	info	pipelines/pipelines.go:132	Stopping exporters...
2022-11-04T09:28:55.529000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 2022-11-04T09:28:55.529Z	info	extensions/extensions.go:56	Stopping extensions...
2022-11-04T09:28:55.529000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 2022-11-04T09:28:55.529Z	info	lambdaextension/lambdaextension.go:58	Unregistered lambda span processor on tracer provider	{"kind": "extension", "name": "lambda"}
2022-11-04T09:28:55.529000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 2022-11-04T09:28:55.529Z	info	service/service.go:130	Shutdown complete.
2022-11-04T09:28:55.530000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 {"level":"debug","msg":"Received SHUTDOWN event"}
2022-11-04T09:28:55.530000+00:00 2022/11/04/[$LATEST]e73f1dda75414fbda94ace08b54c5d86 {"level":"debug","msg":"Exiting"}

Maybe the exporter marks the spans as not active upon sending the request to honeycomb.io, but the goroutine yields while waiting for a response, and the "busy loop" in your new code in the extension checks again, finds no more active spans, and calls Next. This lets AWS freeze the runtime and the response never arrives, which causes the error when it gets thawed again before the runtime is shutdown ?

Nov 04 '22 12:11 nerochiaro

Hello, @Aneurysm9

I also have tested the branch and it seems to be working. There are just weird kinds of stuff going on, I guess. I think the exporter does not send all the trace information to the Central Collector. The Central Collector only receives the first Span, where it should have received at least 4 spans.

Also, I just discovered something. Does it usually take time for the exporter to send the traces to the Central Collector? I'm seeing that the URL for the authentication is being called only after not less than 6 minutes after the Lambda function is invoked (e.g. Lambda Invocation around 11:30 AM, URL is called around 11:38 AM). Which is a weird result. The URL here is where it will fetch the token for authentication.

Here's a sample configuration for the exporter:

exporters:
  otlp:
    endpoint: CENTRAL_REMOTE_COLLECTOR_IP:4317
    tls:
      ca_file: /opt/certificate/ca.pem
      cert_file: /opt/certificate/certificate.pem
      key_file: /opt/certificate/private-key.pem
    auth:
      authenticator: oauth2client

I am also getting almost the same result as @nerochiaro showed

2022-11-16T11:24:04.792+08:00 | 2022-11-16T03:24:04.792Z info service/collector.go:196 Received shutdown request
-- | --
  | 2022-11-16T11:24:04.792+08:00 | 2022-11-16T03:24:04.792Z info service/service.go:138 Starting shutdown...
  | 2022-11-16T11:24:04.792+08:00 | 2022-11-16T03:24:04.792Z info pipelines/pipelines.go:118 Stopping receivers...
  | 2022-11-16T11:24:04.795+08:00 | 2022-11-16T03:24:04.795Z info pipelines/pipelines.go:125 Stopping processors...
  | 2022-11-16T11:24:04.795+08:00 | 2022-11-16T03:24:04.795Z info pipelines/pipelines.go:132 Stopping exporters...
  | 2022-11-16T11:24:04.841+08:00 | 2022-11-16T03:24:04.841Z error exporterhelper/queued_retry.go:176 Exporting failed. No more retries left. Dropping data. {"kind": "exporter", "data_type": "traces", "name": "otlp", "error": "max elapsed time expired rpc error: code = DeadlineExceeded desc = context deadline exceeded", "dropped_items": 7}
  | 2022-11-16T11:24:04.841+08:00 | go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).onTemporaryFailure
  | 2022-11-16T11:24:04.841+08:00 | go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:176
  | 2022-11-16T11:24:04.841+08:00 | go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
  | 2022-11-16T11:24:04.841+08:00 | go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:411
  | 2022-11-16T11:24:04.841+08:00 | go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send
  | 2022-11-16T11:24:04.841+08:00 | go.opentelemetry.io/[email protected]/exporter/exporterhelper/traces.go:134
  | 2022-11-16T11:24:04.841+08:00 | go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
  | 2022-11-16T11:24:04.841+08:00 | go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:206
  | 2022-11-16T11:24:04.841+08:00 | go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1
  | 2022-11-16T11:24:04.841+08:00 | go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:61
  | 2022-11-16T11:24:04.841+08:00 | 2022-11-16T03:24:04.841Z info extensions/extensions.go:56 Stopping extensions...
  | 2022-11-16T11:24:04.841+08:00 | 2022-11-16T03:24:04.841Z info lambdaextension/lambdaextension.go:59 Unregistered lambda span processor on tracer provider {"kind": "extension", "name": "lambda"}
  | 2022-11-16T11:24:04.841+08:00 | 2022-11-16T03:24:04.841Z info service/service.go:152 Shutdown complete.

[ UPDATE ] I run some tests, but most of the time it still drops the traces and never reached the Central Collector.

2022-11-16T06:57:27.864Z	error	exporterhelper/queued_retry.go:176	Exporting failed. No more retries left. Dropping data.	
{
    "kind": "exporter",
    "data_type": "traces",
    "name": "otlp",
    "error": "max elapsed time expired rpc error: code = DeadlineExceeded desc = context deadline exceeded",
    "dropped_items": 8
}

Implementation Reference:

Nov 16 '22 04:11 rmarasigan

Thanks for the feedback @rmarasigan. I have put up a new draft PR that attempts to solve this in a slightly different way using the new Lambda Telemetry API. I think that will be more robust as it does not depend on the function emitting any telemetry to the collector.

As for the time gap you're seeing I believe that is due to the instance being frozen and then thawed later to shutdown. When it is later thawed the contexts on outgoing requests have expired and that error is emitted.

There is an additional potential source of asynchronous behavior that may need to be addressed. Most exporters will use the exporterhelper package, which includes a queued retry capability. Disabling that may be required to ensure that all telemetry has been shipped out of the collector before the function returns. This can be done for the OTLP exporter like this:

exporters:
  otlp:
    sending_queue:
      enabled: false
    ....

Nov 16 '22 22:11 Aneurysm9

There is an additional potential source of asynchronous behavior that may need to be addressed. Most exporters will use the exporterhelper package, which includes a queued retry capability. Disabling that may be required to ensure that all telemetry has been shipped out of the collector before the function returns. This can be done for the OTLP exporter like this:
exporters:
  otlp:
    sending_queue:
      enabled: false
    ....

Thanks for this, @Aneurysm9. It helps a lot after adding it to the configuration and running some tests 👍🏻 The Exporting failed. No more retries left. Dropping data. does not show up now, and I'm seeing that traces are being sent to the Central Collector. I've been using #305 when doing the tests and it seems to be working. But I will still keep an eye on it and the #310. Thanks for the great help! 💯

Nov 17 '22 07:11 rmarasigan

Thanks for the feedback @rmarasigan. I have put up a new draft PR that attempts to solve this in a slightly different way using the new Lambda Telemetry API. I think that will be more robust as it does not depend on the function emitting any telemetry to the collector.

As for the time gap you're seeing I believe that is due to the instance being frozen and then thawed later to shutdown. When it is later thawed the contexts on outgoing requests have expired and that error is emitted.

There is an additional potential source of asynchronous behavior that may need to be addressed. Most exporters will use the exporterhelper package, which includes a queued retry capability. Disabling that may be required to ensure that all telemetry has been shipped out of the collector before the function returns. This can be done for the OTLP exporter like this:
exporters:
  otlp:
    sending_queue:
      enabled: false
    ....

This config tweak does appear to resolve the dropped spans issues I was seeing as well - thank you for the workaround @Aneurysm9! 👏 Will continue watching #305 & #310 for the "proper" long term fix

Nov 17 '22 19:11 morrissimo

Hello @Aneurysm9, I've run some tests again using the #305 PR.

I've tried to make 1, 200 requests to the API Endpoint. So I have 6 terminal tabs open and for each tab, I'm running a script where it will call the API Endpoint to make a request 200 times. But only some of the traces came into the Remote Collector. It is around 500+ that came in and the rest were dropped. And I'm using this configuration while I did the test.

exporters:
  otlp:
    sending_queue:
      enabled: false
    ....

I even tested it with this configuration:

exporters:
  otlp:
    sending_queue:
      enabled: true
      queue_size: 3000
    retry_on_failure:
      enabled: true
      initial_interval: 10s

But still, it is dropping the traces. Here are the errors generated:

2022-11-22T02:56:33.099Z	error	exporterhelper/queued_retry.go:176	Exporting failed. No more retries left. Dropping data.	
{
    "kind": "exporter",
    "data_type": "traces",
    "name": "otlp",
    "error": "max elapsed time expired rpc error: code = DeadlineExceeded desc = context deadline exceeded",
    "dropped_items": 5
}

2022-11-22T02:49:31.577Z	info	exporterhelper/queued_retry.go:427	Exporting failed. Will retry the request after interval.	
{
    "kind": "exporter",
    "data_type": "traces",
    "name": "otlp",
    "error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded",
    "interval": "14.405090881s"
}

Is there any workaround for this kind of scenario? Where there will be a time that the function/endpoint will not be always called. Also, there will be a time when it will be called multiple times (e.g. 200 requests per second). But will still have the complete trace generated to the Central Remote Collector.

[ UPDATE ] The Central Remote Collector is printing this error:

aws-otel-collector[4118]: 2022-11-22T10:00:12.517Z#011info#011zapgrpc/zapgrpc.go:174#011[transport] transport: loopyWriter.run returning. connection error: desc = "transport is closing"#011{"grpc_log": true}

Nov 22 '22 07:11 rmarasigan

@rmarasigan can you please try with #310? I think that #305 is unlikely to provide a complete solution and #310 is more likely to do so.

Nov 22 '22 19:11 Aneurysm9

@rmarasigan can you please try with #310? I think that #305 is unlikely to provide a complete solution and #310 is more likely to do so.

@Aneurysm9 I've tried it with #310, but it still does not complete the trace. Using this config:

exporters:
  otlp:
    sending_queue:
      enabled: false
    retry_on_failure:
      enabled: true
      initial_interval: 10s

1, 200 requests to the API Endpoint but only received 406. Btw, it is a call chain (Service A -> B -> C -> D). The first API triggered, only 406 came into the collector. For the subsequent, it also does not receive complete trace information and some of them are being dropped.

The Central Remote Collector is printing this out:

Nov 23 04:14:32 ip-xxx-xxx-xxx-xxx aws-otel-collector[4118]: 2022-11-23T04:14:32.290Z#011info#011zapgrpc/zapgrpc.go:174#011[transport] transport: loopyWriter.run returning. connection error: desc = "transport is closing"#011{"grpc_log": true}

Nov 23 06:31:16 ip-xxx-xxx-xxx-xxx aws-otel-collector[11294]: 2022-11-23T06:31:16.612Z#011warn#011zapgrpc/zapgrpc.go:191#011[transport] transport: http2Server.HandleStreams failed to read frame: read tcp xxx-xxx-xxx-xxx:4317->xxx-xxx-xxx-xxx:45532: use of closed network connection#011{"grpc_log": true}

Nov 23 06:31:16 ip-xxx-xxx-xxx-xxx aws-otel-collector[11294]: go.uber.org/zap/zapgrpc.(*Logger).Errorln
Nov 23 06:31:16 ip-xxx-xxx-xxx-xxx aws-otel-collector[11294]: google.golang.org/grpc/internal/grpclog.ErrorDepth
Nov 23 06:31:16 ip-xxx-xxx-xxx-xxx aws-otel-collector[11294]: google.golang.org/grpc/grpclog.(*componentData).ErrorDepth
Nov 23 06:31:16 ip-xxx-xxx-xxx-xxx aws-otel-collector[11294]: google.golang.org/grpc/grpclog.(*componentData).Errorf

I've seen an error on the Lambda Layer side:

Error: exit status 2 Extension.Crash

Nov 23 '22 04:11 rmarasigan

Hi @Aneurysm9. Thanks for your work on this.

I've tried running #310 in the RIE but I'm seeing the following, followed by a whole bunch of other errors.

{"level":"fatal","ts":1669184806.0124168,"msg":"Cannot register TelemetryAPI client","error":"request to http://127.0.0.1:9001/2022-07-01/telemetry failed: 404[404 Not Found] 404 page not found\n"}

Is the new telemetry API supported in the emulator? (afaik, the log API that it replaces was not.) Any suggestion here would be appreciated as the emulator is quite an important part of our workflow.

Nov 23 '22 06:11 ja-nick

@rmarasigan are you able to try the otlphttp exporter? I haven't seen that error before, but it looks like an issue with gRPC streams over connections that are broken, presumably when the function instance is frozen.

@ja-nick I do not believe the Telemetry API is supported in the emulator given that it says it "does not support AWS X-Ray tracing or other Lambda integrations." Would an option to suppress the collector startup under the emulator be workable for you?

Nov 23 '22 17:11 Aneurysm9

@rmarasigan are you able to try the otlphttp exporter? I haven't seen that error before, but it looks like an issue with gRPC streams over connections that are broken, presumably when the function instance is frozen.

@Aneurysm9 yes, I've tried using the otlphttp but it gives me an error and does not send the traces to the Central Remote Collector. Also, the project needs/requires to use gRPC instead of HTTP.

Here are the errors found for the otlphttp:

{
    "level": "error",
    "ts": 1669259485.8455393,
    "caller": "exporterhelper/queued_retry.go:394",
    "msg": "Exporting failed. The error is not retryable. Dropping data.",
    "kind": "exporter",
    "data_type": "traces",
    "name": "otlphttp",
    "error": "Permanent error: error exporting items, request to http://xxx:xxx:xxx:xxx:4318/v1/traces responded with HTTP Status Code 400",
    "dropped_items": 6,
    "stacktrace": "go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send\n\tgo.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:394\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send\n\tgo.opentelemetry.io/[email protected]/exporter/exporterhelper/traces.go:133\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).send\n\tgo.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:294\ngo.opentelemetry.io/collector/exporter/exporterhelper.NewTracesExporter.func2\n\tgo.opentelemetry.io/[email protected]/exporter/exporterhelper/traces.go:112\ngo.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces\n\tgo.opentelemetry.io/[email protected]/consumer/traces.go:36\ngo.opentelemetry.io/collector/receiver/otlpreceiver/internal/trace.(*Receiver).Export\n\tgo.opentelemetry.io/collector/receiver/[email protected]/internal/trace/otlp.go:59\ngo.opentelemetry.io/collector/pdata/ptrace/ptraceotlp.rawTracesServer.Export\n\tgo.opentelemetry.io/collector/[email protected]/ptrace/ptraceotlp/grpc.go:72\ngo.opentelemetry.io/collector/pdata/internal/data/protogen/collector/trace/v1._TraceService_Export_Handler.func1\n\tgo.opentelemetry.io/collector/[email protected]/internal/data/protogen/collector/trace/v1/trace_service.pb.go:310\ngo.opentelemetry.io/collector/config/configgrpc.enhanceWithClientInformation.func1\n\tgo.opentelemetry.io/[email protected]/config/configgrpc/configgrpc.go:421\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\tgoogle.golang.org/[email protected]/server.go:1162\ngo.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1\n\tgo.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/[email protected]/interceptor.go:341\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\tgoogle.golang.org/[email protected]/server.go:1165\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1\n\tgoogle.golang.org/[email protected]/server.go:1167\ngo.opentelemetry.io/collector/pdata/internal/data/protogen/collector/trace/v1._TraceService_Export_Handler\n\tgo.opentelemetry.io/collector/[email protected]/internal/data/protogen/collector/trace/v1/trace_service.pb.go:312\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\tgoogle.golang.org/[email protected]/server.go:1340\ngoogle.golang.org/grpc.(*Server).handleStream\n\tgoogle.golang.org/[email protected]/server.go:1713\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\tgoogle.golang.org/[email protected]/server.go:965"
}
{
    "level": "error",
    "ts": 1669259485.8456972,
    "caller": "exporterhelper/queued_retry.go:296",
    "msg": "Exporting failed. Dropping data. Try enabling sending_queue to survive temporary failures.",
    "kind": "exporter",
    "data_type": "traces",
    "name": "otlphttp",
    "dropped_items": 6,
    "stacktrace": "go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).send\n\tgo.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:296\ngo.opentelemetry.io/collector/exporter/exporterhelper.NewTracesExporter.func2\n\tgo.opentelemetry.io/[email protected]/exporter/exporterhelper/traces.go:112\ngo.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces\n\tgo.opentelemetry.io/[email protected]/consumer/traces.go:36\ngo.opentelemetry.io/collector/receiver/otlpreceiver/internal/trace.(*Receiver).Export\n\tgo.opentelemetry.io/collector/receiver/[email protected]/internal/trace/otlp.go:59\ngo.opentelemetry.io/collector/pdata/ptrace/ptraceotlp.rawTracesServer.Export\n\tgo.opentelemetry.io/collector/[email protected]/ptrace/ptraceotlp/grpc.go:72\ngo.opentelemetry.io/collector/pdata/internal/data/protogen/collector/trace/v1._TraceService_Export_Handler.func1\n\tgo.opentelemetry.io/collector/[email protected]/internal/data/protogen/collector/trace/v1/trace_service.pb.go:310\ngo.opentelemetry.io/collector/config/configgrpc.enhanceWithClientInformation.func1\n\tgo.opentelemetry.io/[email protected]/config/configgrpc/configgrpc.go:421\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\tgoogle.golang.org/[email protected]/server.go:1162\ngo.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1\n\tgo.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/[email protected]/interceptor.go:341\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\tgoogle.golang.org/[email protected]/server.go:1165\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1\n\tgoogle.golang.org/[email protected]/server.go:1167\ngo.opentelemetry.io/collector/pdata/internal/data/protogen/collector/trace/v1._TraceService_Export_Handler\n\tgo.opentelemetry.io/collector/[email protected]/internal/data/protogen/collector/trace/v1/trace_service.pb.go:312\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\tgoogle.golang.org/[email protected]/server.go:1340\ngoogle.golang.org/grpc.(*Server).handleStream\n\tgoogle.golang.org/[email protected]/server.go:1713\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\tgoogle.golang.org/[email protected]/server.go:965"
}

OTel Lambda Error: failed to force a flush, lambda may freeze before instrumentation exported:  rpc error: code = Unknown desc = Permanent error: error exporting items, request to http://xxx:xxx:xxx:xxx:4318/v1/traces responded with HTTP Status Code 400

And this is how the config looks:

exporters:
  otlphttp:
    endpoint: https://xxx:xxx:xxx:xxx:4318
    retry_on_failure:
      enabled: true
      initial_interval: 10s
    tls:
      ca_file: /opt/certificate/ca.pem
      cert_file: /opt/certificate/certificate.pem
      key_file: /opt/certificate/private-key.pem
    auth:
      authenticator: oauth2client

Nov 24 '22 03:11 rmarasigan

@Aneurysm9

Would an option to suppress the collector startup under the emulator be workable for you?

That would definitely help. We do generally try to test our telemetry config in the emulator too though.

Best case would be that the Telemetry API integration would be turned off under emulation but that the collector itself continues to work.

Dec 08 '22 02:12 ja-nick

Hi. @ja-nick.

That would definitely help. We do generally try to test our telemetry config in the emulator too though. Best case would be that the Telemetry API integration would be turned off under emulation but that the collector itself > continues to work.

Have you found a solution to this problem? Our workflow involves running the lambda container and collector locally (for testing and debugging purposes). We experience the same problem: request to http://127.0.0.1:9001/2022-07-01/telemetry failed: 404[404 Not Found] 404 page not found

Apr 04 '23 13:04 guyfrid

opentelemetry-lambda opentelemetry-lambda copied to clipboard

ADOT collector sends response to Runtime API instead of waiting for sending traces

opentelemetry-lambda
opentelemetry-lambda copied to clipboard