opentelemetry-collector
opentelemetry-collector copied to clipboard
[consumer] Allow annotating consumer errors with metadata
Description:
Revival of https://github.com/open-telemetry/opentelemetry-collector/pull/7439
This explores one possible way to allow adding metadata to errors returned from consumers. The goal here is to allow transmitting more data back up the pipeline if there is an error at some stage, with the goal of it being used by an upstream component, e.g. a component that will retry data, or a receiver that will propagate an error code back to the sender.
The current design eliminates the permanent/retryable error types in favor of a single error type that supports adding signal data to be retried. If no data is added to be retried, the error is considered permanent. Currently there is no distinction made between the signals for the sake of simplicity, the caller should know what signal is used when retrieving the retryable items from the error. Any options for retrying the data (e.g. a delay) are offered as options when adding data to retry.
The error type currently supports a few general metadata fields that are copied when a downstream error is wrapped:
- Partial successes can be expressed by setting the number of rejected items.
- gRPC and HTTP status codes can be set and translated between if necessary.
Link to tracking Issue:
Resolves #7047
cc @dmitryax
Happy to see this added. As discussed in https://github.com/open-telemetry/opentelemetry-collector/pull/9260, there is a potential to propagate backwards the information contained in PartialSuccess responses from OTLP exports.
I worry about the code complexity introduced to have "success error" responses, meaning error != nil but the interpretation being success. However, this is what it will take to back-propagate partial successes, we want callers to see success with metadata about the number of rejected points if possible. Great to see this, thanks @evan-bradley.
As discussed in https://github.com/open-telemetry/oteps/pull/238, it would be useful for setting the correct otel.outcome label, for callers to have access to the underlying gRPC and/or HTTP error code. Thanks!
This PR was marked stale due to lack of activity. It will be closed in 14 days.
This PR was marked stale due to lack of activity. It will be closed in 14 days.
Codecov Report
Attention: Patch coverage is 98.70130% with 1 line in your changes missing coverage. Please review.
Project coverage is 91.67%. Comparing base (
ecbe02e) to head (1d1affc).
| Files | Patch % | Lines |
|---|---|---|
| consumer/consumererror/error.go | 98.03% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #9041 +/- ##
==========================================
+ Coverage 91.64% 91.67% +0.02%
==========================================
Files 406 408 +2
Lines 19001 19078 +77
==========================================
+ Hits 17414 17490 +76
- Misses 1227 1228 +1
Partials 360 360
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
The committers listed above are authorized under a signed CLA.
- :white_check_mark: login: evan-bradley / name: Evan Bradley (0d5af43affea91f4eb8ec783e47f72e65ea53df9, c26e31a7067020df2030078b922486e3c0305fd6, 2888dbc8104d57773b4ebaf40ad956d5201077f9, 6f85be236071056368f862a0fea2b6ba5789838b, 9e27e93209f086cf9e3b358e94beccc6ec8b149d, 7ddbc0d45440c3f4cca1ca585ea8ef24ad4151ff, 2aa9e86d8665b2b903692fde669a6b1690e5b915, f4216c50400eeba85a245d47918be9df6113f574, 9fe7880d4ef1e97c67a35ada93d9f81141d5fc8b, 6a932c19f7b5306b8a83656cc59a48c90717fd73, a4f738d07aaa4ccdf1c682e3dfceafb9281efacb, a0b8bfdbe9875495ad254e3163f44eba2f71f267, 53cbc05d3430126385aa6d6f6b702ab54d23e91f, 3b80b56b2e981604a3199ba3340601521dd07d45, ea959b6864ba3f25ed15a9e961471efc6fefca46, d19438abddc958bfbd85b555352a5926e4edcfe2, dba92c6a624fb7439347072512844354c6a099d3, fe1802b73c03d64fa88cef892e74e493cabeae52, 3fb5426ec27dfc89fadc30d8c13e69f9a85c67cd, 05d6fa91fc25ed10b40cf39b4709c18fd051214d, 8c054433f3a99261a18fc18795ab9e9ebf86fe7f, d5f7a37c1f9997da592a610a980e7c47f1e4ccc1, a1ca2e11a5a92ea626a36c2fd83f90a3eb6c934a, c8c40ef1d2ae845c8d098f1cd40130347ad5dabb, 4447c00fc081ace1a1a536cc71a5e95d8c3270e0, 3aadc29c90ebc113766e20a688c82d64649cd863, e953920d3a2468035ef6e13ef9b3d52c38c125a7, ef40c0df54776c2df699932dfd015795994ec6e9, 7300bda61eceb6b9bc24719b1a27bc9bcf2d2923, b3cbde5439fa08c06bad4412cd274c139640ab6f, b55f661a34c45d64c0ced99bb7e49dab0ec7e9d9, fc40a8a051f401c179888ed6a079be91ed771dad, dd69f4a6648eb8f186c761e041fb15614bd687fc, 58704eca3f974a6a10f5095769775b6877306694, a3fb82d2d493b909bc876dc936134325dd5e1c9b, 12ffaba9d8993f3bac01195c6f2858fed95aaa2c, 1d1affc80d1a97e25ef26f3250ba825820b5fecf)
@codeboten Any idea what's up with the CLA? I used the "batch suggestions" feature to group the suggestions together. I can rebase the PR and redo those changes if I messed something up.
~~I will merge this by end of week unless there are further comments~~ (edit: no, see https://github.com/open-telemetry/opentelemetry-collector/pull/9041#issuecomment-2153754798)
I will merge this by end of week unless there are further comments
@mx-psi Thanks for keeping an eye on this. I need to make one more change to this to add "error source" metadata to the transport errors, and I also told @dmitryax I would wait for his review. I'd like to hold off until next week if that's okay.
Sure!
If we have a source of the error embedded, it can be potentially used by the fanout consumers during retries. For example, this info can be passed within context to make the fanout consumer retry only to the failed destinations
@TylerHelmuth I looked into asserting that the TransportError type implements the interface used in status.FromError and it turns out that the interface definition is inlined in the function. We could still assert it with our own interface if you would like, but it ultimately feels ineffective to me.
@evan-bradley thats fine
An issue came up recently that wouldve been solved by the new transporterror type. Bumping this PR bc I'd like to see it merged.
On the 2024-07-20 Collector Stability WG meeting we decided to mark this as a release blocker for v0.106.0
Should we briefly define a transition plan in the doc?
I'm okay with that. I've put it in a section at the bottom for now.
@dmitryax @bogdandrutu This is ready for another round of reviews. The scope should be fairly minimal now, we can cover additional functionality in follow-ups.
I will be on PTO starting next week, so it would be nice if we could get this in this week.
Overall feedback for me is that we just want to add more "things". I don't want to block this, but I also feel that we should make sure we add only the things we are 100% sure about it.
Right now we're only adding HTTP/gRPC status code propagation, everything else is marked as unimplemented and still in the design stage. Do you have concerns with HTTP/gRPC status codes, or the way they're communicated?
I really like the MultiErr (current Error) approach and the APIs related to that. It is a nice way to encapsulate fanout errors.
~I played with that, but the issue I encountered is that we may want to read these error types at the receiver level, when there may be multiple fanouts in the pipeline. So you could see something like the following:~
~I don't see how we pull out the joined errors without traversing the whole error tree by repeatedly calling Unwrap. Having a container type to keep track of all the errors just seems simpler.~
I see now, I mistook what you said for the Uber multierr package. Glad you like it.
I am not sure if the ErrorData is the right thing though. Is this only for "Transport" errors? I would not return that bool but instead have a enum or a way to say "IsHttp" or "isGrpc" then have "AsHttp" or "AsGrpc" name to be discussed.
This is meant to hold information related to anything the component would like to communicate upstream. Right now we're starting with HTTP and gRPC status codes, but we would like to add retry information, partial success information, etc. in the future. The goal is to start small and iterate as we design the additional options.
This is meant to hold information related to anything the component would like to communicate upstream. Right now we're starting with HTTP and gRPC status codes, but we would like to add retry information, partial success information, etc. in the future. The goal is to start small and iterate as we design the additional options.
Can you please respond to the rest of the comments and see how the design of the "ErrorData" would look like?
Can you please respond to the rest of the comments and see how the design of the "ErrorData" would look like?
Sure thing. The other questions were based on the first question, so I wanted to make sure we were on the same page there if possible.
I am not sure if the ErrorData is the right thing though. Is this only for "Transport" errors? I would not return that bool but instead have a enum or a way to say "IsHttp" or "isGrpc" then have "AsHttp" or "AsGrpc" name to be discussed.
- Enum: We could do a list of enums, e.g.
[IS_RETRYABLE, IS_GRPC], then expose this with a method likeTypes() []Type. Since an error can contain multiple error types this would have to be a list instead of just a single enum value. IsHTTP+AsHTTP: This essentially conveys the same information asHTTPStatus() (code int, isHTTP bool), so from a functionality standpoint it's not an issue. I would renameIsHTTPtoHasHTTPthough, since errors can contain multiple types of data. I'm not sure I see the benefit of retrieving this information across two methods, though, what's wrong with the boolean?
Enum: We could do a list of enums, e.g. [IS_RETRYABLE, IS_GRPC], then expose this with a method like Types() []Type. Since an error can contain multiple error types this would have to be a list instead of just a single enum value. IsHTTP + AsHTTP: This essentially conveys the same information as HTTPStatus() (code int, isHTTP bool), so from a functionality standpoint it's not an issue. I would rename IsHTTP to HasHTTP though, since errors can contain multiple types of data. I'm not sure I see the benefit of retrieving this information across two methods, though, what's wrong with the boolean?
Before answering any of these questions, need to better understand how "correctly" (recommended, or whatever is the right term) chained consumers should error? For me to know the "right" design, I need to understand better the usage. Here are some examples, and we can debate on them what is the correct way (eventually document this). For these examples assume to use logs:
First Example
A -> B (drops logs with invalid trace identifiers) -> C -> D -> E (exports to Loki)
For a request with 10 logs:
- B finds 2 "logs" that are not valid, for example "trace.id" is 34 hex characters.
- E tries to export the logs and 2 out of the 8 remaining (assuming B dropped the 2 invalid logs) are reject with 400 http request.
Can you please tell me exactly which component should propagate/append errors and how?
Second Exmple
/ -> E - > F
/ -> C -> D -
A -> B - \ -> G -> H
\
\ -> I -> J -> K
For a request with 10 logs:
- F fails to send 4 logs with 429 http request.
- H fails to send 2 logs with 400 http request.
- D drops 2 logs because of invalid trace identifiers (trace.id too large).
- K fails to send 1 logs with INVALID_ARGUMENT grpc status code.
Can you please tell me exactly which component should propagate/append errors and how?