opentelemetry-lambda icon indicating copy to clipboard operation
opentelemetry-lambda copied to clipboard

Deadline exceed in DataDog exporter

Open 3miliano opened this issue 1 year ago • 2 comments

Describe the bug I am experiencing a deadline exceed issue on the DataDog exporter, as evidenced by the logs. This issue results in failed export attempts and subsequent retries.

Steps to reproduce 1. Configure the custom Docker image with the custom collector based on opentelemetry-lambda that includes the DataDog exporter. 2. Initiate data export (traces, logs, metrics). 3. Observe the logs for errors related to context deadlines being exceeded.

What did you expect to see? I expected the data to be exported successfully to DataDog without any timeout errors.

What did you see instead? The export requests failed with “context deadline exceeded” errors, resulting in retries and eventual dropping of the payloads. Here are some excerpts from the logs:

1719687286935 {"level":"warn","ts":1719687286.9350078,"caller":"[email protected]/batch_processor.go:263","msg":"Sender failed","kind":"processor","name":"batch","pipeline":"logs","error":"no more retries left: Post \"https://http-intake.logs.datadoghq.com/api/v2/logs?ddtags=service%3Akognitos.book.yaml%2Cenv%3Amain%2Cregion%3Aus-west-2%2Ccloud_provider%3Aaws%2Cos.type%3Alinux%2Cotel_source%3Adatadog_exporter\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
1719687286936 {"level":"error","ts":1719687286.9363096,"caller":"[email protected]/traces_exporter.go:181","msg":"Error posting hostname/tags series","kind":"exporter","data_type":"traces","name":"datadog","error":"max elapsed time expired Post \"https://api.datadoghq.com/api/v2/series\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","stacktrace":"github.com/open-telemetry/opentelemetry-collector-contrib/exporter/datadogexporter.(*traceExporter).exportUsageMetrics\n\t/root/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/exporter/[email protected]/traces_exporter.go:181\ngithub.com/open-telemetry/opentelemetry-collector-contrib/exporter/datadogexporter.(*traceExporter).consumeTraces\n\t/root/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/exporter/[email protected]/traces_exporter.go:139\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*tracesRequest).Export\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/traces.go:59\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/timeout_sender.go:43\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*baseRequestSender).send\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/common.go:37\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/traces.go:159\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*baseRequestSender).send\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/common.go:37\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*baseRequestSender).send\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/common.go:37\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).send\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/common.go:294\ngo.opentelemetry.io/collector/exporter/exporterhelper.NewTracesRequestExporter.func1\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/traces.go:134\ngo.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/traces.go:25\ngo.opentelemetry.io/collector/processor/batchprocessor.(*batchTraces).export\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/processor/[email protected]/batch_processor.go:414\ngo.opentelemetry.io/collector/processor/batchprocessor.(*shard).sendItems\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/processor/[email protected]/batch_processor.go:261\ngo.opentelemetry.io/collector/processor/batchprocessor.(*shard).startLoop\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/processor/[email protected]/batch_processor.go:223"}

What version of collector/language SDK version did you use? Version: Custom layer-collector/0.8.0 + datadogexporter from v0.103.0

What language layer did you use? Config: None. It is a custom runtime that includes the binary in extensions.

Additional context Here is my configuration file:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "127.0.0.1:4317"
  hostmetrics:
    collection_interval: 60s
    scrapers:
      paging:
        metrics:
          system.paging.utilization:
            enabled: true
      cpu:
        metrics:
          system.cpu.utilization:
            enabled: true
      disk:
      filesystem:
        metrics:
          system.filesystem.utilization:
            enabled: true
      load:
      memory:
      network:
      processes:

exporters:
  datadog:
    api:
      key: ${secretsmanager:infrastructure/datadog_api_key}
    sending_queue:
      enabled: false
    tls:
      insecure: true
      insecure_skip_verify: true

connectors:
  datadog/connector:
      
processors:
  resourcedetection:
    detectors: ["lambda", "system"]
    system:
      hostname_sources: ["os"]
  transform:
    log_statements:
      - context: resource
        statements:
          - delete_key(attributes, "service.version")
          - set(attributes["service"], attributes["service.name"])
          - delete_key(attributes, "service.name")
      - context: log
        statements:
          - set(body, attributes["exception.message"]) where attributes["exception.message"] != nil
          - set(attributes["error.stack"], attributes["exception.stacktrace"]) where attributes["exception.stacktrace"] != nil
          - set(attributes["error.message"], attributes["exception.message"]) where attributes["exception.message"] != nil
          - set(attributes["error.kind"], attributes["exception.kind"]) where attributes["exception.kind"] != nil
service:
  telemetry:
    logs:
      level: "debug"
  pipelines:
    traces:
      receivers: [otlp]
      processors: [resourcedetection]
      exporters: [datadog/connector]
    traces/2:
      receivers: [datadog/connector]
      exporters: [datadog]
    metrics:
      receivers: [hostmetrics, otlp]
      processors: [resourcedetection]
      exporters: [datadog]
    logs:
      receivers: [otlp]
      processors: [resourcedetection, transform]
      exporters: [datadog]

Enabling/disabling sending_queue does not seem to do anything to prevent the errors. I did noticed that if I hit the service continuously some traces do get sent, but only a few.

What I discarded as potential solutions:

  1. Connectivity issues. DataDog has an API Key validation API calls that succeeds. If the service is hit constantly some traces get thru.

3miliano avatar Jun 29 '24 20:06 3miliano

Any reason you're not using the batch processor? That would probably help.

tylerbenson avatar Jul 11 '24 22:07 tylerbenson

@3miliano I think it is because of container freeze right after invocation complete and with those configs you have shared, collector is not aware of Lambda lifecycle. So as @tylerbenson suggested, using batch processor (so it will activate decouple processor by default) right before Datadog exported should resolve your problem.

serkan-ozal avatar Aug 28 '24 22:08 serkan-ozal

@serkan-ozal we are in a similar situation where we see ClientTimeouts very very often, especially with cold starts. If I run a perf test with many threads pounding my lambda, I do not observe this behavior.

We have tested against v0.91 and also v 0.119

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "localhost:4317"
      http:
        endpoint: "localhost:4318"
 
 
exporters:
  debug:
    verbosity: detailed
  otlphttp:
    endpoint: https://enterprise.x.com:1234
    retry_on_failure:
      initial_interval: ${env:ENT_OTEL_INITIAL_INTERVAL}
      max_interval: ${env:ENT_OTEL_MAX_INTERVAL}
      max_elapsed_time: ${env:ENT_OTEL_MAX_ELAPSED_TIME}
    timeout: ${env:ENT_OTEL_TRANSPORT_TIMEOUT}
    sending_queue:
      queue_size: 5000
 
processors:
  decouple:
    max_queue_size: 5000
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [debug, otlphttp]
      processors: [decouple]
    metrics:
      receivers: [otlp]
      exporters: [debug, otlphttp]
      processors: [decouple]

We are on AWS cloud, the gateway is behind an ALB and tried we running a telnet client from the lambda to our ALB; that came back very quickly, < 40ms but this otlphttp exporter fails even at 100ms or higher. At 2000ms, it always works.

Seeking suggestions.

bhaskarbanerjee avatar Mar 21 '25 19:03 bhaskarbanerjee

Hi @bhaskarbanerjee,

I am a little bit lost here. Are you saying that otlphttp exporter works without any issue when its timeout is set to 2000 ms, but fails with timeout if its timeout limit set to 100 ms?

serkan-ozal avatar Mar 22 '25 21:03 serkan-ozal

Yes @serkan-ozal that is correct. At 75ms or 100ms or 150ms ... upto 1500ms I see ClientTimeout too many times. At 2000ms I don't see it at all. My gateway (ALB) is showing a p99 response time of 70ms. I was assuming that if the ALB response time is 70ms then adding few milliseconds more for VPC communication, 75ms or let's say 100ms should be good enough for the collector.

Telnet to the same Gateway comes back within 40ms but I also understand that there is a diff between pure telnet and the collector's export.

bhaskarbanerjee avatar Mar 22 '25 21:03 bhaskarbanerjee

@serkan-ozal any thoughts?

bhaskarbanerjee avatar Mar 25 '25 16:03 bhaskarbanerjee

@bhaskarbanerjee Is that happening when there are many Lambda instances are spinning up and running in parallel? I was just curious about whether there are too many connection requests and some of them are blocked.

serkan-ozal avatar Mar 26 '25 07:03 serkan-ozal

Hi @bhaskarbanerjee, any info/update on this?

serkan-ozal avatar Apr 04 '25 16:04 serkan-ozal

Still testing different scenarios @serkan-ozal Still have not been able to decide whether too many instances is causing this but found few other hints:

  1. The ALB is showing Client TLS Negotiation error whenever the export is failing with client.timeout errors, tried TLS setting in the exporter with insecure_skip_verify: true but that didn't help.
  2. Additionally, I tried exporting to a bad DNS (that should have returned 404) but again got back client.timeout error

I am planning to build it locally by pointing to a modified version of the exporter and exporthelper (with added logging) to find out what is happening there. Plan is for the later half of this week.

Will update you by next Friday.

Many of our customers are starting to complain about this, this wasn't as bad when we rolled it out late last year. Let's see

bhaskarbanerjee avatar Apr 04 '25 21:04 bhaskarbanerjee

@serkan-ozal we enabled DEBUG level logs from the otel collector extension and now we see this http2: Transport failed to get client conn for otelservices.clouddqt.capitalone.com:9990: http2: no cached connection was available This is utilizing http/protobuf not gRPC

Are you aware of any issues with this? I found some analysis here but no solution.

bhaskarbanerjee avatar Apr 16 '25 20:04 bhaskarbanerjee

@serkan-ozal we enabled debug logs on the net/http2 library and observed that the TLS connection is the culprit during session creation. Does opensource community have any stats on TLS1.2 vs TLS1.3 in terms of performance? we are currently on TLS1.2 and are trying to upgrade to 1.3 and test.

bhaskarbanerjee avatar Apr 30 '25 13:04 bhaskarbanerjee

@serkan-ozal We were able to resolve the issue. Let me describe our story here. We have deployed the gateway behind an ALB that is encrypted with TLS1.2

As mentioned in my previous post, we were observing lots of Client side TLS Negotiation errors whenever a data packet was being dropped (Client Timeout); upon enabling http2debug in the otel collector, we found the following:

  1. The client ALWAYS first looks up cache for an existing connection.
  2. If not available, it tries to establish a TLS
  3. Then use that transport to begin the data export In our case step2 was failing because it takes a few handshakes to complete that with the gateway.

We tried to leverage TLS1.3 (because it is faster than TLS1.2) but couldn't due to enterprise restrictions on our end and NLB which didn't yield great results either.

Our Solution: We saw that establishing the TLS was deferred to the actual export, which when failing, was failing the telemetry export.

We have extended a fork of opensource where we are greedily initializing the TLS (non blocking) so that we can allow it sufficient time (arising out of cold start) to complete, rather than curbing it to a short transport timeout later on, in the run. Then we cache the connection.

Would opensource benefit from this contribution, wherein a user could decide to do greedy initialization, when faced with such challenges. PS: There can still be client timeouts resulting in data drops, if the cached connection invalidates after a while (all of which can be further controlled by keepalive param, subject to other nuances).

Please advice.

bhaskarbanerjee avatar May 20 '25 14:05 bhaskarbanerjee

Hi @bhaskarbanerjee,

Thank you for analyzing the issue in detail and finding the solution.

So, as far as I understood, the underlying problem is that when there are too many concurrent connections opened to the Gateway from the local OTEL collector in the Lambda environment, Gateway is not able to respond to the connection and handshake requests quickly enough and you get connection errors when the timeout limit is not high enough.

Is that correct?

Also, do you know that whether or not there are always single connection (to the Gateway) from each local OTEL collector instance in the each Lambda instance?

serkan-ozal avatar May 21 '25 08:05 serkan-ozal

@bhaskarbanerjee

Maybe you can also check some additional http client configs for the exporter here: https://github.com/open-telemetry/opentelemetry-collector/tree/main/config/confighttp#client-configuration

For ex., you may also check whether limiting max_conns_per_host helps or not.

serkan-ozal avatar May 21 '25 08:05 serkan-ozal

Hi @bhaskarbanerjee,

Thank you for analyzing the issue in detail and finding the solution.

So, as far as I understood, the underlying problem is that when there are too many concurrent connections opened to the Gateway from the local OTEL collector in the Lambda environment, Gateway is not able to respond to the connection and handshake requests quickly enough and you get connection errors when the timeout limit is not high enough.

Is that correct?

Also, do you know that whether or not there are always single connection (to the Gateway) from each local OTEL collector instance in the each Lambda instance?

@serkan-ozal Thanks fro that advice. Running the lambda with the collector extension, from console, even once every 5 mins or so also leads to client timeout. There are no running connections from the Lambda, so I am not sure if it is to do with the many concurrent connections.

But let me try out the max_conns_per_host and see if that helps. Will update you.

bhaskarbanerjee avatar May 21 '25 14:05 bhaskarbanerjee

@serkan-ozal setting max_conns_per_host did not have any impact on the problem. In addition, we tested at different keepalive and saw lower very minimal data export errors ( < .1%) when keepalive is between 90-300s

On our end, this tweak is working out well. So we would like to know if there any perf test results from opensource in terms of export-time vs data-drop OR Find out if opensource would benefit from this contribution/if there is a SIG to take this to?

bhaskarbanerjee avatar May 27 '25 17:05 bhaskarbanerjee