HTTP Timeouts don't seem to retry, resulting in dropped spans
What happened?
Steps to Reproduce
https://github.com/jhoward-ts/otel-node-sdk-timeout-example
This is a minimal example - timeout values could be increased, etc, and since this server never succeeds it's irrelevant. This just demonstrates that a single timeout failure loses spans irreparably.
This occurs (with a different error) regardless of whether it's the span processor or trace exporter that times out the request.
Expected Result
-
The node http OTLPExporterNodeBase should retry timed-out connections. https://github.com/open-telemetry/opentelemetry-proto/blob/main/docs/specification.md#all-other-responses This one I'm pretty sure is a bug, in the docs if not Exporter. To me this clearly implies network failures will be retried.
-
The BatchSpanProcessor should hold failed spans and include them in the next export, instead of dropping them for a single failure, until it reaches the maxQueueSize. Admittedly this is my own expectation, and I could not find documentation one way or the other. Maybe I'm using the wrong processor, have configured it wrong, or this behavior is somehow counter to OTEL expectations.
Actual Result
A single http timeout results in immediate span loss.
Additional Details
Feel free to turn this into a Discussion or similar if all of this behavior is expected. I just want to know how I can make my http span exporting more robust against common, transient network issues.
Any clarification here is much appreciated!
OpenTelemetry Setup Code
https://github.com/jhoward-ts/otel-node-sdk-timeout-example
package.json
{
"name": "connector-events-load-test",
"module": "index.ts",
"type": "module",
"devDependencies": {
"@types/bun": "latest"
},
"peerDependencies": {
"typescript": "^5.0.0"
},
"dependencies": {
"@aws-sdk/client-cloudwatch-logs": "^3.523.0",
"@opentelemetry/exporter-trace-otlp-http": "^0.49.1",
"@opentelemetry/sdk-trace-node": "^1.22.0",
"@types/uuid": "^9.0.8",
"axios": "^1.6.7",
"uuid": "^9.0.1"
}
}
Relevant log output
2024-03-01T13:27:43.848Z: Sending trace
{"level":"error","message":"{\"message\":\"Request Timeout\",\"name\":\"OTLPExporterError\",\"originalLine\":\"7\",\"originalColumn\":\"12\",\"line\":\"7\",\"column\":\"12\",\"sourceURL\":\"/home/jhoward/code/opentelemetry-node-sdk-no-timeout-retries/node_modules/@opentelemetry/otlp-exporter-base/build/src/types.js\",\"stack\":\"OTLPExporterError: Request Timeout\\n at new OTLPExporterError (/home/jhoward/code/opentelemetry-node-sdk-no-timeout-retries/node_modules/@opentelemetry/otlp-exporter-base/build/src/types.js:24:13)\\n at <anonymous> (/home/jhoward/code/opentelemetry-node-sdk-no-timeout-retries/node_modules/@opentelemetry/otlp-exporter-base/build/src/platform/node/util.js:9:50)\"}","extra":{"context":"OtelSdk"}}
^C
Received request, aborting...
Span count: 1
Received request, aborting...
Span count: 1
Received request, aborting...
Span count: 1
Yes, timeouts should be retried. Assigning P2 as we loose telemetry due to this.
Hi @jhoward-ts, thanks for reaching out. Sorry for taking so long to respond to this.
Looks like there are three things that we need to consider here:
Non-http connection errors:
- Example: The app has started, but the collector has not started yet - we'll get an
ECONNREFUSED. - We should retry these - there could be some transient problem that causes this. I'd consider it a bug that we don't right now, given the specification mentions that we should re-try on transient errors and there is potential for such an error to be transient.
Timeouts:
- Example: as given in the issue description
- I'm conflicted about these - the timeout we pass in right now is the timeout for the entire export, including any retries. So if a request times out here, there's no time anymore to re-try the export anyway. We'd have to have a
timeoutUntilRetryoption to allow end-users to set that specifically.- OTel Python handles it the same way as we do: https://github.com/open-telemetry/opentelemetry-python/blob/5ddb8e7407a06b560c5483f7649913e170229936/exporter/opentelemetry-exporter-otlp-proto-http/src/opentelemetry/exporter/otlp/proto/http/trace_exporter/init.py#L174-L208
- OTel Java seems to take a similar approach - https://github.com/open-telemetry/opentelemetry-java/blob/2c0e6c49860035281817b4ee5f5978a9bfbaf6d5/exporters/sender/okhttp/src/main/java/io/opentelemetry/exporter/sender/okhttp/internal/OkHttpHttpSender.java#L85-L89
- they do have a separate connect timeout, but that functionality does not exist in Node.js http module AFAIK - timeout makes no distinction for connect and read/write phases.
Handling of failed batches
- Example: if export fails due to the above reasons,
BatchSpanProcessorwe will drop the spans - I think this is intended behavior. I did check the OTel specification and it does not specify what to do in that case. Other SDK implementations also seem to drop on failed exports.
A side-note on timeout handing in bun: I see that the example here is targeting bun. There was a bug in bun that we hit where the http implementation behaved differently depending on if a user was running node or bun that we since solved by dropping support for old Node.js runtimes. What happened there is that the timeout would always trigger immediately (#5260). So if you're seeing similar behavior using bun, this should be fixed in the latest version of the exporter.