opentelemetry-collector
opentelemetry-collector copied to clipboard
status reporting
Description: I started a POC for status reporting in https://github.com/open-telemetry/opentelemetry-collector/pull/5158. That PR served as a good starting point for discussion and feedback, but the implementation has changed direction enough, that I decided to close it in favor of this PR.
As a brief recap, this work was spun out of a discussion in the contrib repo around improving startup behavior, and error reporting at startup. See:
- https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/8816
- https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/8817
A proposal was suggested in a contrib issue: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/8816#issuecomment-1083380460
The proposal is then to change the component.Host to allow components to report its state (and continue reporting it), adding another function allowing components to register for callbacks, to be called whenever there's a report. This way, components such as the load balancing exporter and the health check extension can register for callbacks, whereas scrapers and components like the OTLP exporter can report their connection failures in real time.
This PR attempts to provide a status reporting mechanism as described in the above comment. To that end, the Host interface was extended with a ReportStatus function that components can use to report ok, or error statuses, and a RegisterStatusListener function, where interested components (i.e. the healthcheck extension, etc) can register callbacks to receive status notifications.
In discussions on #5158 and at the SIG meeting, it was suggested that the same mechanism could replace the PipelineWatcher interface. RegisterStatusListener also allows optional registration of PipelineReady and PipelineNotReady callbacks. I have not removed the PipelineWatcher interface in this PR, but can (and will) remove it in a follow up PR when (and if) we decide this is the direction we want to head in.
Another suggestion was that this system could also replace the Host.ReportFatalError function. If Host.ReportStatus is called with a status.Event of type status.FatalError it will behave the same way as Host.ReportFatalError does today.
Another requirement is that this mechanism should work with both full and partial restarts of the service. The idea is that components will register status listeners during startup, and unregister during shutdown. So long as this convention is followed, we should always have active components reporting and listening for status notifications. We might need to discuss and adjust what pipeline ready, or not ready means in the context of a partial restart, but I think we can come up with a reasonable way to accommodate that scenario.
~Lastly, we probably should discuss the new error types introduced in this PR. This PR adds recoverable, permanent, and fatal to the consumererror package. The use cases for recoverable and fatal are pretty self explanatory. A fatal error will cause the collector to shutdown, whereas a recoverable error, is likely transient and could resolve over time. The permanent case, is somewhat less obvious, and it should be noted that the consumererror package has the same error type. If we keep it, we should try to find a common place for the error to reside. I could see a permanent error resulting from a component retrying recoverable errors for a time period. Once it has exhausted the retry strategy, it might choose to shut itself down and return a permanent error to indicate early shutdown of a component. I'm not sure if we actually have this use case today, or whether it's imaginary at this point.~
Basic usage:
Register and Unregister
// All callbacks are optional. A listener can register as many, or few as desired. Some listeners
// might care about status events only, other might only be interested in pipeline ready and / or
// not ready.
func (c *mycomponent) start(_ context.Context, host component.host) {
c.unregisterFunc := host.RegisterStatusListener(
component.WithStatusEventHandler(statusEventHandler),
component.WithPipelineReadyHandler(readyHandler),
component.WithPipelineNotReadyHandler(notReadyHandler),
)
}
func (c *mycomponent) statusEventHandler(event component.StatusEvent) {
switch event.Type {
case component.StatusOK:
// component is ok (or has recovered)
case component.StatusRecoverableError:
//handle recoverable
case component.StatusPermanentError:
//handle permanent
case component.StatusFatalError:
//handle fatal, collector will shutdown after listener callbacks are executed
default:
//handle unset
}
}
func (c *mycomponent) shutdown(_ context.Context) {
// call the unregisterFunc during shutdown, or when appropriate
err := c.unregisterFunc()
}
ReportStatus
// recoverable error
if err := s.connect(); err != nil {
ev, _ := componet.NewStatusEvent(component.StatusRecoverableError, component.WithError(err))
host.ReportStatus(ev)
}
// permanent error
if err := s.connect(); err != nil {
ev, _ := component.NewStatusEvent(component.StatusPermanentError, component.WithError(err))
host.ReportStatus(ev)
}
// fatal error
if err := s.superImportant(); err != nil {
ev, _ := component.NewStatusEvent(component.FatalError, component.WithError(err))
host.ReportStatus(ev)
}
// everything is fine
if err := s.connect(); err == nil {
ev, _ := component.NewStatusEvent(component.StatusOK)
host.ReportStatus(ev)
}
// status.OK and status.WithError not valid
if ev, err := component.NewStatusEvent(component.StatusOK, component.WithError(err)); err != nil {
logger.Warn("error provided with ok status")
}
Summary
To summarize the above, this PR:
- introduces a status reporting mechanism for components to report status, and interested components to register callbacks to listen for status
- provides a potential replacement for the
PipelineWatcherinterface - provides a potential replacement for
Host.ReportFatalErrorfunction - ~introduces a
componenterrorpackage withfatal,recoverable, andpermanenterrors~
Link to tracking Issue: See:
- https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/8816#issuecomment-1083380460
- https://github.com/open-telemetry/opentelemetry-collector/pull/5158
Testing:
- Mainly unit testing
Documentation:
- Source code documentation
Codecov Report
Merging #5304 (d6bc9cc) into main (85c26ea) will increase coverage by
0.19%. The diff coverage is94.31%.
:exclamation: Current head d6bc9cc differs from pull request most recent head 90baf5f. Consider uploading reports for the commit 90baf5f to get more accurate results
@@ Coverage Diff @@
## main #5304 +/- ##
==========================================
+ Coverage 90.63% 90.83% +0.19%
==========================================
Files 190 192 +2
Lines 11435 11575 +140
==========================================
+ Hits 10364 10514 +150
+ Misses 851 837 -14
- Partials 220 224 +4
| Impacted Files | Coverage Δ | |
|---|---|---|
| service/internal/extensions/extensions.go | 81.81% <0.00%> (ø) |
|
| service/service.go | 42.16% <40.00%> (+0.37%) |
:arrow_up: |
| component/componenttest/nop_host.go | 100.00% <100.00%> (ø) |
|
| component/status/status.go | 100.00% <100.00%> (ø) |
|
| service/host.go | 100.00% <100.00%> (ø) |
|
| service/internal/builder/exporters_builder.go | 91.61% <100.00%> (+0.05%) |
:arrow_up: |
| service/internal/builder/pipelines_builder.go | 87.83% <100.00%> (+0.06%) |
:arrow_up: |
| service/internal/builder/receivers_builder.go | 82.09% <100.00%> (+0.11%) |
:arrow_up: |
| service/internal/components/host_wrapper.go | 100.00% <100.00%> (ø) |
|
| ...ervice/internal/components/status/notifications.go | 100.00% <100.00%> (ø) |
|
| ... and 14 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 85c26ea...90baf5f. Read the comment docs.
This PR was marked stale due to lack of activity. It will be closed in 14 days.
This PR was marked stale due to lack of activity. It will be closed in 14 days.
Is there any news about this one?
This PR was marked stale due to lack of activity. It will be closed in 14 days.
This PR was marked stale due to lack of activity. It will be closed in 14 days.
This PR was marked stale due to lack of activity. It will be closed in 14 days.
This PR was marked stale due to lack of activity. It will be closed in 14 days.
@bogdandrutu, do you see this PR being merged anytime soon? I'd rather close it, as this has been with no activity since March, with my message for a status request on Jul 4 going unanswered.
This PR was marked stale due to lack of activity. It will be closed in 14 days.
Closing.