opentelemetry-collector icon indicating copy to clipboard operation
opentelemetry-collector copied to clipboard

Automatic status reporting for exporterhelper and core exporters

Open mwear opened this issue 1 year ago • 8 comments

Description: This PR adds a WithStatusReporting option to exporterhelper to opt-in to automatic status reporting Consume* calls and updates the core exporters to use it.

The behavior for Consume* is to report StatusOK when it returns without an error and StatusRecoverableError when it returns with an error. Users can report a more severe status (e.g. StatusPermanentError, StatusFatalError) from Consume from within the component, if they wish, and it will effectively override the automatic status reporting. The OTLP exporters demonstrate this usage by reporting permanent errors for certain statuses in this PR.

A component should report StatusPermanentError for error conditions that require human intervention to fix. Typically this is a configuration issue that is discovered at run time. A 401 (Unauthorized) status is an example of such an error. The collector config was structurally valid, but it has become clear at runtime that credentials are missing or invalid. Note: once a component transitions into StatusPermanentError it cannot make any further transitions. This is enforced by the finite state machine used by the underlying status reporting system.

The following HTTP status codes are treated as PermanentErrors for the OTLP/HTTP exporter:

  • http.StatusUnauthorized (401)
  • http.StatusForbidden (403)
  • http.StatusNotFound (404)
  • http.StatusMethodNotAllowed (405)
  • http.StatusRequestEntityTooLarge (413)
  • http.StatusRequestURITooLong (414)
  • http.StatusRequestHeaderFieldsTooLarge (431)

The following gRPC codes are treated as PermanentErrors for the OTLP/gRPC exporter:

  • codes.NotFound (5)
  • codes.PermissionDenied (7)
  • codes.Unauthenticated (16)

The choice of status codes we treat as permanent errors are negotiable.

Link to tracking Issue: #7682

Testing: Units, manual

Documentation: Comments

Edit I opened #8788 for comparison and as a manual alternative to this PR. Please compare this with #8788 and weigh in on what approach you prefer, and/or additional suggestions.

mwear avatar Oct 13 '23 20:10 mwear

Codecov Report

Attention: Patch coverage is 99.12281% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 91.61%. Comparing base (946dc24) to head (60d8eb9). Report is 292 commits behind head on main.

:exclamation: Current head 60d8eb9 differs from pull request most recent head ea04ada. Consider uploading reports for the commit ea04ada to get more accurate results

Files Patch % Lines
exporter/otlphttpexporter/otlp.go 95.45% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8684      +/-   ##
==========================================
+ Coverage   90.27%   91.61%   +1.34%     
==========================================
  Files         343      317      -26     
  Lines       18003    17253     -750     
==========================================
- Hits        16252    15807     -445     
+ Misses       1424     1151     -273     
+ Partials      327      295      -32     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Oct 13 '23 21:10 codecov[bot]

~~I'm closing this for the same reason as mentioned here: https://github.com/open-telemetry/opentelemetry-collector/pull/8709#issuecomment-1779817651~~

mwear avatar Oct 25 '23 18:10 mwear

I closed this PR after our discussion on #8709 last week, but after revisiting the problem of reporting status for exporters, I think this approach is viable and worth further consideration. This is a limited version of #8709 that optionally wraps the Consume* and Start methods for exporters and will report status based on the returned error. Since an exporter sits at the end of a pipeline, the errors that it emits are related to the status of the exporter itself. By intercepting the errors we are able to reduce the amount of boilerplate code for reporting StatusOK and StatusRecoverableError leaving the component to handle more severe errors (e.g. StatusPermanentError).

mwear avatar Nov 01 '23 00:11 mwear

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Nov 16 '23 03:11 github-actions[bot]

We had a productive conversation about this at the SIG meeting this morning. One idea that we discussed was providing more information on the consumer errors that are passed up the pipeline. Specifically, the original status code. That would allow us to apply the same permanent error handling to all exporters via the wrapped consume methods.

Another point of discussion was the permanent errors themselves. When a component transitions to a permanent error state there is no way to clear that state or make any further transitions. The point of the permanent error state is to indicate that a component has encountered a situation that will require human intervention to resolve. Exactly what statuses fit this bill is a topic of debate. We could consider omitting permanent errors for now until we have a better handle on what we want to classify as permanent, and or consider some systems to make this a configurable aspect of error reporting.

mwear avatar Nov 29 '23 19:11 mwear

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Dec 14 '23 03:12 github-actions[bot]

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Jan 06 '24 03:01 github-actions[bot]

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Feb 08 '24 03:02 github-actions[bot]

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Mar 01 '24 03:03 github-actions[bot]

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Mar 23 '24 03:03 github-actions[bot]

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Apr 08 '24 03:04 github-actions[bot]

Closed as inactive. Feel free to reopen if this PR is still being worked on.

github-actions[bot] avatar Apr 26 '24 03:04 github-actions[bot]