opentelemetry-collector
opentelemetry-collector copied to clipboard
Automatic status reporting for exporterhelper and core exporters
Description:
This PR adds a WithStatusReporting
option to exporterhelper
to opt-in to automatic status reporting Consume*
calls and updates the core exporters to use it.
The behavior for Consume*
is to report StatusOK
when it returns without an error and StatusRecoverableError
when it returns with an error. Users can report a more severe status (e.g. StatusPermanentError
, StatusFatalError
) from Consume
from within the component, if they wish, and it will effectively override the automatic status reporting.
The OTLP exporters demonstrate this usage by reporting permanent errors for certain statuses in this PR.
A component should report StatusPermanentError
for error conditions that require human intervention to fix. Typically this is a configuration issue that is discovered at run time. A 401 (Unauthorized) status is an example of such an error. The collector config was structurally valid, but it has become clear at runtime that credentials are missing or invalid. Note: once a component transitions into StatusPermanentError
it cannot make any further transitions. This is enforced by the finite state machine used by the underlying status reporting system.
The following HTTP status codes are treated as PermanentError
s for the OTLP/HTTP exporter:
- http.StatusUnauthorized (401)
- http.StatusForbidden (403)
- http.StatusNotFound (404)
- http.StatusMethodNotAllowed (405)
- http.StatusRequestEntityTooLarge (413)
- http.StatusRequestURITooLong (414)
- http.StatusRequestHeaderFieldsTooLarge (431)
The following gRPC codes are treated as PermanentError
s for the OTLP/gRPC exporter:
- codes.NotFound (5)
- codes.PermissionDenied (7)
- codes.Unauthenticated (16)
The choice of status codes we treat as permanent errors are negotiable.
Link to tracking Issue: #7682
Testing: Units, manual
Documentation: Comments
Edit I opened #8788 for comparison and as a manual alternative to this PR. Please compare this with #8788 and weigh in on what approach you prefer, and/or additional suggestions.
Codecov Report
Attention: Patch coverage is 99.12281%
with 1 lines
in your changes are missing coverage. Please review.
Project coverage is 91.61%. Comparing base (
946dc24
) to head (60d8eb9
). Report is 292 commits behind head on main.
:exclamation: Current head 60d8eb9 differs from pull request most recent head ea04ada. Consider uploading reports for the commit ea04ada to get more accurate results
Files | Patch % | Lines |
---|---|---|
exporter/otlphttpexporter/otlp.go | 95.45% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #8684 +/- ##
==========================================
+ Coverage 90.27% 91.61% +1.34%
==========================================
Files 343 317 -26
Lines 18003 17253 -750
==========================================
- Hits 16252 15807 -445
+ Misses 1424 1151 -273
+ Partials 327 295 -32
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
~~I'm closing this for the same reason as mentioned here: https://github.com/open-telemetry/opentelemetry-collector/pull/8709#issuecomment-1779817651~~
I closed this PR after our discussion on #8709 last week, but after revisiting the problem of reporting status for exporters, I think this approach is viable and worth further consideration. This is a limited version of #8709 that optionally wraps the Consume*
and Start
methods for exporters and will report status based on the returned error. Since an exporter sits at the end of a pipeline, the errors that it emits are related to the status of the exporter itself. By intercepting the errors we are able to reduce the amount of boilerplate code for reporting StatusOK
and StatusRecoverableError
leaving the component to handle more severe errors (e.g. StatusPermanentError
).
This PR was marked stale due to lack of activity. It will be closed in 14 days.
We had a productive conversation about this at the SIG meeting this morning. One idea that we discussed was providing more information on the consumer errors that are passed up the pipeline. Specifically, the original status code. That would allow us to apply the same permanent error handling to all exporters via the wrapped consume methods.
Another point of discussion was the permanent errors themselves. When a component transitions to a permanent error state there is no way to clear that state or make any further transitions. The point of the permanent error state is to indicate that a component has encountered a situation that will require human intervention to resolve. Exactly what statuses fit this bill is a topic of debate. We could consider omitting permanent errors for now until we have a better handle on what we want to classify as permanent, and or consider some systems to make this a configurable aspect of error reporting.
This PR was marked stale due to lack of activity. It will be closed in 14 days.
This PR was marked stale due to lack of activity. It will be closed in 14 days.
This PR was marked stale due to lack of activity. It will be closed in 14 days.
This PR was marked stale due to lack of activity. It will be closed in 14 days.
This PR was marked stale due to lack of activity. It will be closed in 14 days.
This PR was marked stale due to lack of activity. It will be closed in 14 days.
Closed as inactive. Feel free to reopen if this PR is still being worked on.