opentelemetry-js icon indicating copy to clipboard operation
opentelemetry-js copied to clipboard

HTTP Timeouts don't seem to retry, resulting in dropped spans

Open jhoward-ts opened this issue 1 year ago • 2 comments

What happened?

Steps to Reproduce

https://github.com/jhoward-ts/otel-node-sdk-timeout-example

This is a minimal example - timeout values could be increased, etc, and since this server never succeeds it's irrelevant. This just demonstrates that a single timeout failure loses spans irreparably.

This occurs (with a different error) regardless of whether it's the span processor or trace exporter that times out the request.

Expected Result

  1. The node http OTLPExporterNodeBase should retry timed-out connections. https://github.com/open-telemetry/opentelemetry-proto/blob/main/docs/specification.md#all-other-responses This one I'm pretty sure is a bug, in the docs if not Exporter. To me this clearly implies network failures will be retried.

  2. The BatchSpanProcessor should hold failed spans and include them in the next export, instead of dropping them for a single failure, until it reaches the maxQueueSize. Admittedly this is my own expectation, and I could not find documentation one way or the other. Maybe I'm using the wrong processor, have configured it wrong, or this behavior is somehow counter to OTEL expectations.

Actual Result

A single http timeout results in immediate span loss.

Additional Details

Feel free to turn this into a Discussion or similar if all of this behavior is expected. I just want to know how I can make my http span exporting more robust against common, transient network issues.

Any clarification here is much appreciated!

OpenTelemetry Setup Code

https://github.com/jhoward-ts/otel-node-sdk-timeout-example

package.json

{
  "name": "connector-events-load-test",
  "module": "index.ts",
  "type": "module",
  "devDependencies": {
    "@types/bun": "latest"
  },
  "peerDependencies": {
    "typescript": "^5.0.0"
  },
  "dependencies": {
    "@aws-sdk/client-cloudwatch-logs": "^3.523.0",
    "@opentelemetry/exporter-trace-otlp-http": "^0.49.1",
    "@opentelemetry/sdk-trace-node": "^1.22.0",
    "@types/uuid": "^9.0.8",
    "axios": "^1.6.7",
    "uuid": "^9.0.1"
  }
}

Relevant log output

2024-03-01T13:27:43.848Z: Sending trace
{"level":"error","message":"{\"message\":\"Request Timeout\",\"name\":\"OTLPExporterError\",\"originalLine\":\"7\",\"originalColumn\":\"12\",\"line\":\"7\",\"column\":\"12\",\"sourceURL\":\"/home/jhoward/code/opentelemetry-node-sdk-no-timeout-retries/node_modules/@opentelemetry/otlp-exporter-base/build/src/types.js\",\"stack\":\"OTLPExporterError: Request Timeout\\n    at new OTLPExporterError (/home/jhoward/code/opentelemetry-node-sdk-no-timeout-retries/node_modules/@opentelemetry/otlp-exporter-base/build/src/types.js:24:13)\\n    at <anonymous> (/home/jhoward/code/opentelemetry-node-sdk-no-timeout-retries/node_modules/@opentelemetry/otlp-exporter-base/build/src/platform/node/util.js:9:50)\"}","extra":{"context":"OtelSdk"}}
^C

Received request, aborting...
Span count: 1
Received request, aborting...
Span count: 1
Received request, aborting...
Span count: 1

jhoward-ts avatar Mar 01 '24 13:03 jhoward-ts

Yes, timeouts should be retried. Assigning P2 as we loose telemetry due to this.

pichlermarc avatar Mar 06 '24 12:03 pichlermarc

Hi @jhoward-ts, thanks for reaching out. Sorry for taking so long to respond to this.

Looks like there are three things that we need to consider here:

Non-http connection errors:

  • Example: The app has started, but the collector has not started yet - we'll get an ECONNREFUSED.
  • We should retry these - there could be some transient problem that causes this. I'd consider it a bug that we don't right now, given the specification mentions that we should re-try on transient errors and there is potential for such an error to be transient.

Timeouts:

  • Example: as given in the issue description
  • I'm conflicted about these - the timeout we pass in right now is the timeout for the entire export, including any retries. So if a request times out here, there's no time anymore to re-try the export anyway. We'd have to have a timeoutUntilRetry option to allow end-users to set that specifically.
    • OTel Python handles it the same way as we do: https://github.com/open-telemetry/opentelemetry-python/blob/5ddb8e7407a06b560c5483f7649913e170229936/exporter/opentelemetry-exporter-otlp-proto-http/src/opentelemetry/exporter/otlp/proto/http/trace_exporter/init.py#L174-L208
    • OTel Java seems to take a similar approach - https://github.com/open-telemetry/opentelemetry-java/blob/2c0e6c49860035281817b4ee5f5978a9bfbaf6d5/exporters/sender/okhttp/src/main/java/io/opentelemetry/exporter/sender/okhttp/internal/OkHttpHttpSender.java#L85-L89
      • they do have a separate connect timeout, but that functionality does not exist in Node.js http module AFAIK - timeout makes no distinction for connect and read/write phases.

Handling of failed batches

  • Example: if export fails due to the above reasons, BatchSpanProcessor we will drop the spans
  • I think this is intended behavior. I did check the OTel specification and it does not specify what to do in that case. Other SDK implementations also seem to drop on failed exports.

A side-note on timeout handing in bun: I see that the example here is targeting bun. There was a bug in bun that we hit where the http implementation behaved differently depending on if a user was running node or bun that we since solved by dropping support for old Node.js runtimes. What happened there is that the timeout would always trigger immediately (#5260). So if you're seeing similar behavior using bun, this should be fixed in the latest version of the exporter.

pichlermarc avatar Nov 17 '25 15:11 pichlermarc