opentelemetry-specification
opentelemetry-specification copied to clipboard
Clarify and improve the OTLP exporter requirements of a retry strategy
The OTLP exporter specification lacks clarity with regards to its retry strategy. I have reviewed a number of implementations across languages as well as the collector. From what I have seen, implementations deviate in various ways from one another.
The following statement is about as clear as the spec gets:
Beyond this, the spec descends into weak statements that lend themselves to differing interpretations and it even contradicts itself in a number of spots. Below is a list of areas where I believe improvements are necessary.
The retry strategy of the OTLP exporter is not specified
This has posed a number of issues:
- SDKs differ in their implementation of their retry strategy.
- Different vendors have expressed varying requirements with regards to retry.
I work for a vendor and the first point concerns me the most. When speaking with my end users, I cannot be confident that when my company's OTLP endpoint signals a request can be retried that this signal will be handled uniformly irrespective of language SDK used.
The second point is primarily what has blocked the .NET SIG from implementing any retry strategy at all. It may be necessary for multiple strategies to be specified allowing end users to opt in to the strategy that best meets their needs.
Transient error needs to be clearly defined
The spec lacks a clear definition for what is considered a transient error and therefore what MUST be retried by the OTLP exporter.
Response codes
The spec suggests that transient errors are defined, in part, by the response codes returned by the server.
Specifically with regards to HTTP status codes the exporter spec contradicts the protocol spec.
For example, my conclusion from the following statement is that an exporter MUST retry HTTP 408 and all 5xx response codes.
For OTLP/HTTP, the errors
408 (Request Timeout)
and5xx (Server Errors)
are defined as transient
However, I have doubts this interpretation was intended as the statement is contradicted in at least two ways by the protocol spec which:
- Contains a table of a very limited set of retryable 4xx/5xx HTTP response codes.
- Downgrades the normative language defining a client's behavior from a MUST retry to a SHOULD retry.
Beyond response codes
The spec for OTLP/HTTP contains two more scenarios where a client should (must?) retry. It is unclear whether these scenarios are considered transient errors. If they are considered transient errors, then it seems these statements need to be MUST rather than SHOULD.
The spec for OTLP/gRPC does not contain similar language. Should it?
Exponential backoff w/ jitter
There are multiple statements in the spec that indicate a client needs to employ an exponential backoff w/ jitter when retrying. I have two observations with regards to this requirement.
The requirement from the exporter spec is stronger than that of the protocol spec
For example, in the following two statements, one uses MUST and the other SHOULD.
Is it intentional that the exporter spec has a stronger requirement than the protocol spec?
The implementation of the exponential backoff strategy is not specified
Perhaps leaving the implementation unspecified was intentional. However, for the benefit of consistency across SDKs, I think it would beneficial to specify an implementation. I do not think this needs to be complicated. The gRPC retry spec offers some simple prior art.
When should the OTLP exporter stop retrying?
It is not clearly specified if and when the OTLP exporter should give up retrying.
It might be assumed that the exporter should continue to retry up until its configured timeout. However, if this is the case, then the spec needs to indicate this.
Another option could be to specify a maximum number of retry attempts. The Java implementation has done this, for example.