opentelemetry-collector icon indicating copy to clipboard operation
opentelemetry-collector copied to clipboard

[consumer] Add new otlp-centric error type

Open TylerHelmuth opened this issue 1 year ago • 7 comments

Description

Adds a new otlp-centric error type. This type improves the ability for otlp exporters to report errors to otlp consumers, while also giving any component the ability to properly interpret http/grpc/retriable errors.

With this type, the http/grpc/retriable datasets are always considered according to the OTLP specification. If a non-otlp component receives this error type, that can properly map the http/grpc/retryable error to their form based on OTLP's interpretation of the underlying data.

For example, if the error type stored HTTP status 404 a component checking for the http status code within the error would know that the 404 means Not Found according to OTLP. If the component needed to translate 404 Not Found to their own response code, they would be able to trust the meaning.

Link to tracking issue

Continuation of https://github.com/open-telemetry/opentelemetry-collector/pull/9041 Closes https://github.com/open-telemetry/opentelemetry-collector/issues/7047

Testing

Added tests

Documentation

Added godoc comments

TylerHelmuth avatar Sep 09 '24 17:09 TylerHelmuth

Codecov Report

Attention: Patch coverage is 72.97297% with 20 lines in your changes missing coverage. Please review.

Project coverage is 91.78%. Comparing base (59c083f) to head (5acc22d). Report is 558 commits behind head on main.

Files with missing lines Patch % Lines
consumer/consumererror/error.go 67.56% 10 Missing and 2 partials :warning:
...sumererror/internal/statusconversion/conversion.go 76.47% 8 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #11085      +/-   ##
==========================================
- Coverage   91.85%   91.78%   -0.08%     
==========================================
  Files         416      418       +2     
  Lines       19925    19996      +71     
==========================================
+ Hits        18302    18353      +51     
- Misses       1245     1263      +18     
- Partials      378      380       +2     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Sep 09 '24 17:09 codecov[bot]

I wonder if this approach is becoming too specific to be useful. HTTP and gRPC status codes (and conversion between them) should be protocol agnostic. Is there any way we can design this so that we have a general consumer error that contains HTTP and gRPC status codes that can be extended to add protocol specific functionality as needed?

@mwear the discussion @evan-bradley @bogdandrutu @dmitryax and I had led us to the conclusion that a isolated http status or grpc status, with no knowledge of its source, led to ambiguity. Essentially a protocol-agnostic approach meant we could not be sure what the code meant.

For example, a non-OTLP exporter, call it exporter A, may use http protocol to make connections with some backend, but, similar to OTLP, it may be speaking a protocol on top of HTTP. That protocol could return a 500 http status code when it is trying to communicate to its clients that it was their fault. This would be in violation of http's definition of 500, but as long as the sever and client have agreed upon their protocol they'll be able to communicate effectively.

In the situation where exporter A wants to propagate this error back up the pipeline, recording a generic HTTP 500 status code will cause problems. In this scenario the 500 status code is representing a client error, but if that code made it back to a receiver than did not speak the same protocol as exporter A, such as the OTLP receiver, it will interpret the code incorrectly. In this scenario the OTLP receiver should return an error that indicates the client had a problem, likely a 400, but it would see the 500 and interpret it incorrectly as a server error.

Our solution to this was to force a protocol into the error and the one that made the most sense was OTLP. If we instead included an additional piece of metadata, such as WithProtocol, then all the receivers are left with the work to map from all the different possible protocols to their own. By building in the OTLP protocol, the work is now on the exporter, which only has to do a single mapping of its protocol to OTLP. Then all receivers only have to do a single mapping of OTLP to their protocol. And any component that speaks OTLP gets off easy.

If, in the initial scenario, there is a receiver A that speaks the same protocol, then exporter A could return its own special error type and receiver A could look for that error type. We aren't restricting what types of errors an exporter can send back up the pipeline, only providing a consistent, reliable container that we'll recommend using.

TylerHelmuth avatar Sep 16 '24 22:09 TylerHelmuth

@TylerHelmuth I would like to add to the list of considerations that there are a couple of existing places where I think we need a protocol-agnostic error status. Discussed in https://github.com/open-telemetry/opentelemetry-collector/issues/11183, there are situations where the exporterhelper's Timeout sender and Retry sender may (IMO) want to abort a request because the context deadline is shorter than the configured timeout or backoff interval. The root cause is that Golang's Context is protocol agnostic, and we have failures induced by the context which are not Canceled and not DeadlineExceeded. I'm looking for an Aborted status code. How would you propose to return protocol-agnostic errors from exporterhelper?

See https://github.com/open-telemetry/opentelemetry-collector/pull/11198.

Have we considered using the gRPC error code space, since it was designed as a universal solution, while preserving protocol-specific error values too? For this case discussed above, I would expect to use a codes.Aborted w/ a message--it's a gRPC error code but it's a protocol-agnostic error.

jmacd avatar Sep 17 '24 19:09 jmacd

@jmacd I may not be understanding the issue well enough, but the intention is that the exporter in question translate its grpc status to the OTLP equivalent grpc status and use NewOTLPGRPCError. That puts the burden of translation on the exporter, who is the most capable to understand both its own protocol and otlp, instead of the receivers.

Have we considered using the gRPC error code space

This is what we do today. It works for the most part, but fails as a transport mechanism for http since translating to and from http status codes and grpc status codes is lossy.

TylerHelmuth avatar Sep 19 '24 17:09 TylerHelmuth

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Oct 18 '24 03:10 github-actions[bot]

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Nov 16 '24 03:11 github-actions[bot]

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Dec 10 '24 03:12 github-actions[bot]

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Dec 27 '24 03:12 github-actions[bot]

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Jan 12 '25 03:01 github-actions[bot]

Closed as inactive. Feel free to reopen if this PR is still being worked on.

github-actions[bot] avatar Jan 28 '25 03:01 github-actions[bot]