opentelemetry-specification icon indicating copy to clipboard operation
opentelemetry-specification copied to clipboard

Clarify and improve the OTLP exporter requirements of a retry strategy

Open alanwest opened this issue 1 year ago • 5 comments

The OTLP exporter specification lacks clarity with regards to its retry strategy. I have reviewed a number of implementations across languages as well as the collector. From what I have seen, implementations deviate in various ways from one another.

The following statement is about as clear as the spec gets:

Transient errors MUST be handled with a retry strategy.

Beyond this, the spec descends into weak statements that lend themselves to differing interpretations and it even contradicts itself in a number of spots. Below is a list of areas where I believe improvements are necessary.

The retry strategy of the OTLP exporter is not specified

This has posed a number of issues:

  1. SDKs differ in their implementation of their retry strategy.
  2. Different vendors have expressed varying requirements with regards to retry.

I work for a vendor and the first point concerns me the most. When speaking with my end users, I cannot be confident that when my company's OTLP endpoint signals a request can be retried that this signal will be handled uniformly irrespective of language SDK used.

The second point is primarily what has blocked the .NET SIG from implementing any retry strategy at all. It may be necessary for multiple strategies to be specified allowing end users to opt in to the strategy that best meets their needs.

Transient error needs to be clearly defined

The spec lacks a clear definition for what is considered a transient error and therefore what MUST be retried by the OTLP exporter.

Response codes

The spec suggests that transient errors are defined, in part, by the response codes returned by the server.

Specifically with regards to HTTP status codes the exporter spec contradicts the protocol spec.

For example, my conclusion from the following statement is that an exporter MUST retry HTTP 408 and all 5xx response codes.

For OTLP/HTTP, the errors 408 (Request Timeout) and 5xx (Server Errors) are defined as transient

However, I have doubts this interpretation was intended as the statement is contradicted in at least two ways by the protocol spec which:

  1. Contains a table of a very limited set of retryable 4xx/5xx HTTP response codes.
  2. Downgrades the normative language defining a client's behavior from a MUST retry to a SHOULD retry.

Beyond response codes

The spec for OTLP/HTTP contains two more scenarios where a client should (must?) retry. It is unclear whether these scenarios are considered transient errors. If they are considered transient errors, then it seems these statements need to be MUST rather than SHOULD.

  1. If the client cannot connect to the server, the client SHOULD retry the connection using an exponential backoff strategy between retries

  2. If the server disconnects without returning a response, the client SHOULD retry and send the same request

The spec for OTLP/gRPC does not contain similar language. Should it?

Exponential backoff w/ jitter

There are multiple statements in the spec that indicate a client needs to employ an exponential backoff w/ jitter when retrying. I have two observations with regards to this requirement.

The requirement from the exporter spec is stronger than that of the protocol spec

For example, in the following two statements, one uses MUST and the other SHOULD.

  1. This retry strategy MUST implement an exponential back-off with jitter to avoid overwhelming the destination until the network is restored or the destination has recovered

  2. When retrying, the client SHOULD implement an exponential backoff strategy

Is it intentional that the exporter spec has a stronger requirement than the protocol spec?

The implementation of the exponential backoff strategy is not specified

Perhaps leaving the implementation unspecified was intentional. However, for the benefit of consistency across SDKs, I think it would beneficial to specify an implementation. I do not think this needs to be complicated. The gRPC retry spec offers some simple prior art.

When should the OTLP exporter stop retrying?

It is not clearly specified if and when the OTLP exporter should give up retrying.

It might be assumed that the exporter should continue to retry up until its configured timeout. However, if this is the case, then the spec needs to indicate this.

Another option could be to specify a maximum number of retry attempts. The Java implementation has done this, for example.

alanwest avatar Aug 03 '23 00:08 alanwest