opentelemetry-collector icon indicating copy to clipboard operation
opentelemetry-collector copied to clipboard

[consumererror] Add OTLP-centric error type

Open evan-bradley opened this issue 7 months ago • 5 comments

Description

Continuation of https://github.com/open-telemetry/opentelemetry-collector/pull/11085.

Link to tracking issue

Fixes https://github.com/open-telemetry/opentelemetry-collector/issues/7047

evan-bradley avatar May 15 '25 21:05 evan-bradley

Codecov Report

Attention: Patch coverage is 92.85714% with 8 lines in your changes missing coverage. Please review.

Project coverage is 91.57%. Comparing base (1046576) to head (20af93c). Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...sumererror/internal/statusconversion/conversion.go 80.00% 8 Missing :warning:

:x: Your patch status has failed because the patch coverage (92.85%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #13042      +/-   ##
==========================================
+ Coverage   91.55%   91.57%   +0.01%     
==========================================
  Files         526      528       +2     
  Lines       29365    29474     +109     
==========================================
+ Hits        26886    26991     +105     
- Misses       1953     1958       +5     
+ Partials      526      525       -1     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar May 15 '25 21:05 codecov[bot]

I'll look at improving the code coverage tomorrow. In the meantime, this should be in a pretty good state.

evan-bradley avatar May 15 '25 23:05 evan-bradley

The remaining functions missing test coverage are the status code conversion functions, which are pretty direct. I don't think tests are very helpful since the functions are pretty direct mappings. The only thing I can think of that would meaningfully improve coverage is to store the mappings in a map object as opposed to in a switch statement, but feels like a slightly worse implementation.

evan-bradley avatar May 16 '25 18:05 evan-bradley

Since this was specially controversial last time, I suggest we wait either until we have more approvals (I suggest 4) or some time has passed (I would suggest Friday next week).

cc @open-telemetry/collector-approvers

mx-psi avatar May 21 '25 15:05 mx-psi

There is a big problem we identified in the past, which is that the default behavior of the errors in the collector pipelines is that they are retryable. It seems that this PR changes that, which I 1000% support, but we need to make sure we document this change and analyze the impact of that.

Agreed. That will come in a follow-up once we start to use this, I will make sure we proceed discerningly here.

evan-bradley avatar Jun 02 '25 12:06 evan-bradley

@bogdandrutu @jmacd Thanks for your questions, they've helped me challenge some assumptions in the implementation.

Before I make code changes (though I would be happy to do so if you want me to illustrate any points), I want to propose how we handle the cases you've asked about. I want to try to answer your questions in a single comment instead of individually.

  1. In the case where origErr does not contain any nested information conflicting with the status code, use the caller-provided status code. I think this is straightforward and agreed upon.
  2. In the case where origErr does contain conflicting information (e.g. a nested status.Status) assume the caller has the most relevant context and use the provided HTTP/gRPC status code.
  3. When origErr is nil, we assume the caller does not have or doesn't want to include any underlying error. In this case, we return the most we can from the Error method and nil from Unwrap.
  4. When the user passes an HTTP or gRPC success code, we assume this is a qualified success and don't take any action. Our OTLP exporters are the only places where we explicitly handle the case that @jmacd pointed out, so I don't think we would save much complexity by avoiding this. gRPC also handles this slightly differently and assumes an "OK" status is actually the result of erroneous code, and remaps the status in the resulting *status.Status to be codes.Unknown. Additionally, the OTLP specification explicitly calls out that partial success responses have a success response code, so I think there's some validity to not explicitly overriding this.

Let me know if you have an issue with points 1-3, though I think they should be fairly uncontroversial. I think the situation with the most nuance is point 4, to which I can think of a few approaches to take:

  1. Do nothing and assume this is a qualified success. I don't want to discuss exactly what that will look like in this PR, and until we determine a way to concretely represent that, I think we should advise that these errors aren't created with success response codes.
  2. Map success codes to nil error values to override the fact that they aren't errors. If we make a separate error type for representing partial success responses or other qualified successes, I think this will likely be okay.
  3. Map success codes to codes.Unknown like gRPC does. I think I like this option the least of the three.

My proposal would be approach 1 until we determine how to proceed with partial success responses. We can decide whether to switch to approach 2 at that time.

evan-bradley avatar Jun 23 '25 20:06 evan-bradley

Do nothing and assume this is a qualified success.

Sounds good and conservative. I'd say it gives room to improve in the future, though nothing's easy.

jmacd avatar Jun 30 '25 19:06 jmacd

In the case where origErr does contain conflicting information (e.g. a nested status.Status) assume the caller has the most relevant context and use the provided HTTP/gRPC status code.

if the origErr is grpc.Status and we call with a new HTTP status, are you also planning to remove the grpc.Status from the error chain? Otherwise the new generated error may be in the same time both and usages of "error.Is" may break or get confused.

When the user passes an HTTP or gRPC success code, we assume this is a qualified success and don't take any action.

To make sure I understand, does this mean you return nil from NewOTLPHTTPError(nil, 200)?

bogdandrutu avatar Jul 01 '25 17:07 bogdandrutu

if the origErr is grpc.Status and we call with a new HTTP status, are you also planning to remove the grpc.Status from the error chain? Otherwise the new generated error may be in the same time both and usages of "error.Is" may break or get confused.

Just to verify, did you mean errors.As? I think errors.Is will always return false since we aren't working with instantiated errors.

As for using errors.As, if I understand you correctly, even if the error is both, I expect callers will check for consumererror.Error first and status.Status second, in which case there shouldn't be any breakage or confusion.

To make sure I understand, does this mean you return nil from NewOTLPHTTPError(nil, 200)?

This will still return an error. We need to develop this further, but the goal for now is that it is up to the caller to determine whether they want to return an error or not, and we don't make any assumptions about their intent. Maybe "qualified success" isn't the right term for now since nowhere else in the Collector understands how to handle them.

evan-bradley avatar Jul 01 '25 17:07 evan-bradley

As for using errors.As, if I understand you correctly, even if the error is both, I expect callers will check for consumererror.Error first and status.Status second, in which case there shouldn't be any breakage or confusion.

This is not what I see today in code, see OTLP receiver.

bogdandrutu avatar Jul 01 '25 18:07 bogdandrutu

This will still return an error. We need to develop this further, but the goal for now is that it is up to the caller to determine whether they want to return an error or not, and we don't make any assumptions about their intent. Maybe "qualified success" isn't the right term for now since nowhere else in the Collector understands how to handle them.

This may break existing code which will return an error and caller may retry and cause lots of duplicate data, possible infinite retries, etc.

bogdandrutu avatar Jul 01 '25 18:07 bogdandrutu

This is not what I see today in code, see OTLP receiver.

The goal is that we will supplement or replace that code using this error type; a major motivating factor for this new error type is that translating the gRPC status code into an HTTP status code is currently a lossy operation.

This may break existing code which will return an error and caller may retry and cause lots of duplicate data, possible infinite retries, etc.

I think that will only occur if exporters make significant changes to the way they handle errors. For example, in the OTLP/HTTP exporter, this line will go from:

return formattedErr

to

return consumererror.NewOTLPHTTPError(formattedErr, resp.StatusCode)

In this case, we've already validated that resp.StatusCode is not 200. I still don't expect anyone to go out of their way to use a 200 error code here, but I don't want to be prescriptive until we've decided on how we want to handle qualified successes.

evan-bradley avatar Jul 01 '25 18:07 evan-bradley

I still don't expect anyone to go out of their way to use a 200 error code here, but I don't want to be prescriptive until we've decided on how we want to handle qualified successes.

Then let's make it impossible to call with 200 (success) and forbid that, so we can in the future change it the way we want that behavior.

bogdandrutu avatar Jul 02 '25 06:07 bogdandrutu

Then let's make it impossible to call with 200 (success) and forbid that, so we can in the future change it the way we want that behavior.

I like the idea of forbidding it so we can make future changes not breaking, but in a "new error" function that feels challenging. I see three options here:

  1. Make the functions panic, like we did here. I don't like the idea of panicking, especially at runtime, but this guarantees us that callers will check for and avoid calls using success codes.
  2. Return nil. This has the upside of being a "no-op" in case callers blindly use a 200 code, but also silently masks what I would consider a bug, and will be a breaking change if we decide to return an error in the future.
  3. Have the New[...]Error functions return (error, error). This is how we usually handle these situations, but in this case that feels odd.

A side note, for the implementation, I will do this for codes.Ok for gRPC (the behavior of the OTLP exporter) and [200, 299] for HTTP, since that's the behavior of the OTLP/HTTP exporter.

evan-bradley avatar Jul 02 '25 12:07 evan-bradley

I am ok to panic in this and clearly document it.

sfc-gh-bdrutu avatar Jul 03 '25 18:07 sfc-gh-bdrutu