[consumererror] Add OTLP-centric error type
Description
Continuation of https://github.com/open-telemetry/opentelemetry-collector/pull/11085.
Link to tracking issue
Fixes https://github.com/open-telemetry/opentelemetry-collector/issues/7047
Codecov Report
Attention: Patch coverage is 92.85714% with 8 lines in your changes missing coverage. Please review.
Project coverage is 91.57%. Comparing base (
1046576) to head (20af93c). Report is 2 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| ...sumererror/internal/statusconversion/conversion.go | 80.00% | 8 Missing :warning: |
:x: Your patch status has failed because the patch coverage (92.85%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage.
Additional details and impacted files
@@ Coverage Diff @@
## main #13042 +/- ##
==========================================
+ Coverage 91.55% 91.57% +0.01%
==========================================
Files 526 528 +2
Lines 29365 29474 +109
==========================================
+ Hits 26886 26991 +105
- Misses 1953 1958 +5
+ Partials 526 525 -1
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
I'll look at improving the code coverage tomorrow. In the meantime, this should be in a pretty good state.
The remaining functions missing test coverage are the status code conversion functions, which are pretty direct. I don't think tests are very helpful since the functions are pretty direct mappings. The only thing I can think of that would meaningfully improve coverage is to store the mappings in a map object as opposed to in a switch statement, but feels like a slightly worse implementation.
Since this was specially controversial last time, I suggest we wait either until we have more approvals (I suggest 4) or some time has passed (I would suggest Friday next week).
cc @open-telemetry/collector-approvers
There is a big problem we identified in the past, which is that the default behavior of the errors in the collector pipelines is that they are retryable. It seems that this PR changes that, which I 1000% support, but we need to make sure we document this change and analyze the impact of that.
Agreed. That will come in a follow-up once we start to use this, I will make sure we proceed discerningly here.
@bogdandrutu @jmacd Thanks for your questions, they've helped me challenge some assumptions in the implementation.
Before I make code changes (though I would be happy to do so if you want me to illustrate any points), I want to propose how we handle the cases you've asked about. I want to try to answer your questions in a single comment instead of individually.
- In the case where
origErrdoes not contain any nested information conflicting with the status code, use the caller-provided status code. I think this is straightforward and agreed upon. - In the case where
origErrdoes contain conflicting information (e.g. a nestedstatus.Status) assume the caller has the most relevant context and use the provided HTTP/gRPC status code. - When
origErrisnil, we assume the caller does not have or doesn't want to include any underlying error. In this case, we return the most we can from theErrormethod andnilfromUnwrap. - When the user passes an HTTP or gRPC success code, we assume this is a qualified success and don't take any action. Our OTLP exporters are the only places where we explicitly handle the case that @jmacd pointed out, so I don't think we would save much complexity by avoiding this. gRPC also handles this slightly differently and assumes an "OK" status is actually the result of erroneous code, and remaps the status in the resulting
*status.Statusto becodes.Unknown. Additionally, the OTLP specification explicitly calls out that partial success responses have a success response code, so I think there's some validity to not explicitly overriding this.
Let me know if you have an issue with points 1-3, though I think they should be fairly uncontroversial. I think the situation with the most nuance is point 4, to which I can think of a few approaches to take:
- Do nothing and assume this is a qualified success. I don't want to discuss exactly what that will look like in this PR, and until we determine a way to concretely represent that, I think we should advise that these errors aren't created with success response codes.
- Map success codes to
nilerror values to override the fact that they aren't errors. If we make a separate error type for representing partial success responses or other qualified successes, I think this will likely be okay. - Map success codes to
codes.Unknownlike gRPC does. I think I like this option the least of the three.
My proposal would be approach 1 until we determine how to proceed with partial success responses. We can decide whether to switch to approach 2 at that time.
Do nothing and assume this is a qualified success.
Sounds good and conservative. I'd say it gives room to improve in the future, though nothing's easy.
In the case where origErr does contain conflicting information (e.g. a nested status.Status) assume the caller has the most relevant context and use the provided HTTP/gRPC status code.
if the origErr is grpc.Status and we call with a new HTTP status, are you also planning to remove the grpc.Status from the error chain? Otherwise the new generated error may be in the same time both and usages of "error.Is" may break or get confused.
When the user passes an HTTP or gRPC success code, we assume this is a qualified success and don't take any action.
To make sure I understand, does this mean you return nil from NewOTLPHTTPError(nil, 200)?
if the origErr is grpc.Status and we call with a new HTTP status, are you also planning to remove the grpc.Status from the error chain? Otherwise the new generated error may be in the same time both and usages of "error.Is" may break or get confused.
Just to verify, did you mean errors.As? I think errors.Is will always return false since we aren't working with instantiated errors.
As for using errors.As, if I understand you correctly, even if the error is both, I expect callers will check for consumererror.Error first and status.Status second, in which case there shouldn't be any breakage or confusion.
To make sure I understand, does this mean you return
nilfromNewOTLPHTTPError(nil, 200)?
This will still return an error. We need to develop this further, but the goal for now is that it is up to the caller to determine whether they want to return an error or not, and we don't make any assumptions about their intent. Maybe "qualified success" isn't the right term for now since nowhere else in the Collector understands how to handle them.
As for using errors.As, if I understand you correctly, even if the error is both, I expect callers will check for consumererror.Error first and status.Status second, in which case there shouldn't be any breakage or confusion.
This is not what I see today in code, see OTLP receiver.
This will still return an error. We need to develop this further, but the goal for now is that it is up to the caller to determine whether they want to return an error or not, and we don't make any assumptions about their intent. Maybe "qualified success" isn't the right term for now since nowhere else in the Collector understands how to handle them.
This may break existing code which will return an error and caller may retry and cause lots of duplicate data, possible infinite retries, etc.
This is not what I see today in code, see OTLP receiver.
The goal is that we will supplement or replace that code using this error type; a major motivating factor for this new error type is that translating the gRPC status code into an HTTP status code is currently a lossy operation.
This may break existing code which will return an error and caller may retry and cause lots of duplicate data, possible infinite retries, etc.
I think that will only occur if exporters make significant changes to the way they handle errors. For example, in the OTLP/HTTP exporter, this line will go from:
return formattedErr
to
return consumererror.NewOTLPHTTPError(formattedErr, resp.StatusCode)
In this case, we've already validated that resp.StatusCode is not 200. I still don't expect anyone to go out of their way to use a 200 error code here, but I don't want to be prescriptive until we've decided on how we want to handle qualified successes.
I still don't expect anyone to go out of their way to use a 200 error code here, but I don't want to be prescriptive until we've decided on how we want to handle qualified successes.
Then let's make it impossible to call with 200 (success) and forbid that, so we can in the future change it the way we want that behavior.
Then let's make it impossible to call with 200 (success) and forbid that, so we can in the future change it the way we want that behavior.
I like the idea of forbidding it so we can make future changes not breaking, but in a "new error" function that feels challenging. I see three options here:
- Make the functions panic, like we did here. I don't like the idea of panicking, especially at runtime, but this guarantees us that callers will check for and avoid calls using success codes.
- Return
nil. This has the upside of being a "no-op" in case callers blindly use a 200 code, but also silently masks what I would consider a bug, and will be a breaking change if we decide to return an error in the future. - Have the
New[...]Errorfunctions return(error, error). This is how we usually handle these situations, but in this case that feels odd.
A side note, for the implementation, I will do this for codes.Ok for gRPC (the behavior of the OTLP exporter) and [200, 299] for HTTP, since that's the behavior of the OTLP/HTTP exporter.
I am ok to panic in this and clearly document it.