aws-otel-lambda icon indicating copy to clipboard operation
aws-otel-lambda copied to clipboard

Spans being dropped

Open chris-armstrong opened this issue 2 years ago • 9 comments

Some of my spans are being dropped by the lambda nodejs layer. I'm experiencing errors like the following in my CloudWatch logs:

2022-03-02T01:07:36.673Z	error	exporterhelper/queued_retry_inmemory.go:93	Exporting failed. No more retries left. Dropping data.	
{
    "kind": "exporter",
    "name": "otlp",
    "error": "max elapsed time expired rpc error: code = DeadlineExceeded desc = context deadline exceeded",
    "dropped_items": 3
}

I've tried tweaking my collector configuration to extend timeouts and the like, but I get the impression these errors are caused by the Lambda container going to sleep while spans are being sent, as these errors appear either at the start of the next request to the same container, or when it is shutting down.

My collector configuration is as follows:

receivers:
  otlp:
    protocols:
      http:

exporters:
  logging:
    loglevel: debug
  otlp:
    endpoint: api.honeycomb.io:443
    headers:
      "x-honeycomb-team": "${HONEYCOMB_API_KEY}"
      "x-honeycomb-dataset": "${NORMALISED_ENVIRONMENT_NAME}"
    retry_on_failure:
      max_elapsed_time: 900 
      max_interval: 30
      initial_interval: 2
    
service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlp]

I'm using arn:aws:lambda:ap-southeast-2:901920570463:layer:aws-otel-nodejs-amd64-ver-1-0-1:1.

Is there something wrong with my configuration that stops it from sending the spans before it goes to sleep? I am not setting context.callbackWaitsForEmptyEventLoop to false (I presume it defaults to true)

chris-armstrong avatar Mar 02 '22 02:03 chris-armstrong

@chris-armstrong we hit into this too, and here is an explanation why it doesn't work: https://github.com/open-telemetry/opentelemetry-lambda/issues/224

TL;DR Collector extension doesn't wait for data to be sent before making the next call to the Lambda Runtime API, which results in execution environment freeze before the send is ready.

nikovirtala avatar Mar 10 '22 10:03 nikovirtala

I thought this might be the case looking at the code but refused to believe that the developers would have neglected such an obvious problem - the lambda goes to sleep after the extension returns, so it ought to wait for the spans to flush out before returning

chris-armstrong avatar Mar 10 '22 10:03 chris-armstrong

Hi @chris-armstrong thanks for flagging this. We will take a look.

alolita avatar Mar 14 '22 17:03 alolita

Also see my comment here -

https://github.com/open-telemetry/opentelemetry-lambda/issues/224#issuecomment-1067490152

In short I think using the SimpleSpanProcessor instead of the BatchSpanProcessor is the problem, and this makes sense (contrary to his advice) in light of what @wangzlei said about the collector operation (i.e. the collector actually synchronously uploads requests from the receiver, so as long as the lambda function code makes sure it has synchronised all spans before it returns, the collector should keep the lambda running until it has exported all of them)

chris-armstrong avatar Mar 15 '22 04:03 chris-armstrong

Seems like trace.WithSyncer(tp) instead of trace.WithBatcher(tp) is solving the issue for Lambda function.

tp := trace.NewTracerProvider(
    trace.WithSyncer(exp),
    ...
)

szymon3 avatar Aug 23 '22 12:08 szymon3

Hi I am getting the same problem is there any fix or a way to mitigate the issue?

mcmikeyy avatar Sep 23 '22 07:09 mcmikeyy

I am also getting this same issue, using version 1.6.0 of the ADOT layer (JS). I noticed that there are two patterns of usage, one of which makes the issue noticeable, and one of which "masks" it, making it seem like everything is working fine.

  • (1) if run the lambda once then wait my traces don't go out. if i wait more time I can see the extension getting terminated and the pending send failing with the error noted by @chris-armstrong in the OP.
  • (2) if the lambda is run several times in quick succession, i can see traces going out. I suspect this is because the exporter starts to send out the data then gets frozen, but thaws out again at the next lambda call and gets to complete the export.

I can see why this is not a popular issue, as most people production usage falls into case (2). Nevertheless it should be fixed.

To me it seems that the extension does not correctly make use of the lambda extension API, which allows an extension to run longer than the parent lambda. In this case, it should stay running as long as it needs to send out the traces.

Would producing a minimal lambda implementation help with reproducing this ? Anything else I can do to help with this issue ?

nerochiaro avatar Oct 10 '22 13:10 nerochiaro

This still happens with version 1.7.0 of the layer.

I posted here a quick analysis of what seems to cause the problem in the interaction between the collector and the lambda extension API.

nerochiaro avatar Oct 10 '22 15:10 nerochiaro

Chiming in as yet another user who's experienced this behavior, with the Python runtime - including the nuance that @nerochiaro described above, where one-off executions of a Lambda do not send traces, but quick successive executions do

morrissimo avatar Oct 18 '22 20:10 morrissimo

I have been working through some potential solutions to this issue and have created https://github.com/open-telemetry/opentelemetry-lambda/pull/305 to get feedback on one approach I think will work. I would appreciate any feedback if you are able to review and/or test this approach.

Aneurysm9 avatar Nov 02 '22 17:11 Aneurysm9

Hate to be that guy, but: any updates on this issue? This behavior is a major blocker for our adoption of OpenTelemetry :/ Even some sort of workaround (like the ability to "force flush" the collector's buffered data, etc) while a More Correct solution is worked on would be great

morrissimo avatar Nov 16 '22 15:11 morrissimo

fyi this workaround seems to be working - so at least some way to move on until a proper fix can be implemented for the base opentelemetry-lambda components

morrissimo avatar Nov 17 '22 19:11 morrissimo

https://github.com/open-telemetry/opentelemetry-lambda/pull/310/ has been merged and will be included in the next release of the ADOT Lambda Layer, which is presently being prepared. That will ensure that collector exporters do not use the queued retry functionality of the exporter helper and also uses the Lambda Telemetry API to avoid indicating that the collector is ready for the next Lambda event until after function invocation is complete.

Aneurysm9 avatar Dec 09 '22 23:12 Aneurysm9