opentelemetry-collector icon indicating copy to clipboard operation
opentelemetry-collector copied to clipboard

status reporting

Open mwear opened this issue 3 years ago • 8 comments

Description: I started a POC for status reporting in https://github.com/open-telemetry/opentelemetry-collector/pull/5158. That PR served as a good starting point for discussion and feedback, but the implementation has changed direction enough, that I decided to close it in favor of this PR.

As a brief recap, this work was spun out of a discussion in the contrib repo around improving startup behavior, and error reporting at startup. See:

  • https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/8816
  • https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/8817

A proposal was suggested in a contrib issue: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/8816#issuecomment-1083380460

The proposal is then to change the component.Host to allow components to report its state (and continue reporting it), adding another function allowing components to register for callbacks, to be called whenever there's a report. This way, components such as the load balancing exporter and the health check extension can register for callbacks, whereas scrapers and components like the OTLP exporter can report their connection failures in real time.

This PR attempts to provide a status reporting mechanism as described in the above comment. To that end, the Host interface was extended with a ReportStatus function that components can use to report ok, or error statuses, and a RegisterStatusListener function, where interested components (i.e. the healthcheck extension, etc) can register callbacks to receive status notifications.

In discussions on #5158 and at the SIG meeting, it was suggested that the same mechanism could replace the PipelineWatcher interface. RegisterStatusListener also allows optional registration of PipelineReady and PipelineNotReady callbacks. I have not removed the PipelineWatcher interface in this PR, but can (and will) remove it in a follow up PR when (and if) we decide this is the direction we want to head in.

Another suggestion was that this system could also replace the Host.ReportFatalError function. If Host.ReportStatus is called with a status.Event of type status.FatalError it will behave the same way as Host.ReportFatalError does today.

Another requirement is that this mechanism should work with both full and partial restarts of the service. The idea is that components will register status listeners during startup, and unregister during shutdown. So long as this convention is followed, we should always have active components reporting and listening for status notifications. We might need to discuss and adjust what pipeline ready, or not ready means in the context of a partial restart, but I think we can come up with a reasonable way to accommodate that scenario.

~Lastly, we probably should discuss the new error types introduced in this PR. This PR adds recoverable, permanent, and fatal to the consumererror package. The use cases for recoverable and fatal are pretty self explanatory. A fatal error will cause the collector to shutdown, whereas a recoverable error, is likely transient and could resolve over time. The permanent case, is somewhat less obvious, and it should be noted that the consumererror package has the same error type. If we keep it, we should try to find a common place for the error to reside. I could see a permanent error resulting from a component retrying recoverable errors for a time period. Once it has exhausted the retry strategy, it might choose to shut itself down and return a permanent error to indicate early shutdown of a component. I'm not sure if we actually have this use case today, or whether it's imaginary at this point.~

Basic usage:

Register and Unregister

// All callbacks are optional. A listener can register as many, or few as desired. Some listeners
// might care about status events only, other might only be interested in pipeline ready and / or
// not ready.
func (c *mycomponent) start(_ context.Context, host component.host) {
    c.unregisterFunc := host.RegisterStatusListener(
        component.WithStatusEventHandler(statusEventHandler),
        component.WithPipelineReadyHandler(readyHandler),
        component.WithPipelineNotReadyHandler(notReadyHandler),
    )
}

func (c *mycomponent) statusEventHandler(event component.StatusEvent) {
    switch event.Type {
        case component.StatusOK:
		// component is ok (or has recovered)
	case component.StatusRecoverableError:
		//handle recoverable
        case component.StatusPermanentError:
		//handle permanent
        case component.StatusFatalError:
		//handle fatal, collector will shutdown after listener callbacks are executed
	default:
		//handle unset
    }
}

func (c *mycomponent) shutdown(_ context.Context) {
    // call the unregisterFunc during shutdown, or when appropriate
    err := c.unregisterFunc()
}

ReportStatus

// recoverable error
if err := s.connect(); err != nil {
    ev, _ := componet.NewStatusEvent(component.StatusRecoverableError, component.WithError(err))
    host.ReportStatus(ev)
}

// permanent error
if err := s.connect(); err != nil {
    ev, _ := component.NewStatusEvent(component.StatusPermanentError, component.WithError(err))
    host.ReportStatus(ev)
}

// fatal error
if err := s.superImportant(); err != nil {
    ev, _ := component.NewStatusEvent(component.FatalError, component.WithError(err))
    host.ReportStatus(ev)
}

// everything is fine
if err := s.connect(); err == nil {
    ev, _ := component.NewStatusEvent(component.StatusOK)
    host.ReportStatus(ev)
}

// status.OK and status.WithError not valid
if ev, err := component.NewStatusEvent(component.StatusOK, component.WithError(err)); err != nil {
    logger.Warn("error provided with ok status")
}

Summary

To summarize the above, this PR:

  • introduces a status reporting mechanism for components to report status, and interested components to register callbacks to listen for status
  • provides a potential replacement for the PipelineWatcher interface
  • provides a potential replacement for Host.ReportFatalError function
  • ~introduces a componenterror package with fatal, recoverable, and permanent errors~

Link to tracking Issue: See:

  • https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/8816#issuecomment-1083380460
  • https://github.com/open-telemetry/opentelemetry-collector/pull/5158

Testing:

  • Mainly unit testing

Documentation:

  • Source code documentation

mwear avatar May 03 '22 02:05 mwear

Codecov Report

Merging #5304 (d6bc9cc) into main (85c26ea) will increase coverage by 0.19%. The diff coverage is 94.31%.

:exclamation: Current head d6bc9cc differs from pull request most recent head 90baf5f. Consider uploading reports for the commit 90baf5f to get more accurate results

@@            Coverage Diff             @@
##             main    #5304      +/-   ##
==========================================
+ Coverage   90.63%   90.83%   +0.19%     
==========================================
  Files         190      192       +2     
  Lines       11435    11575     +140     
==========================================
+ Hits        10364    10514     +150     
+ Misses        851      837      -14     
- Partials      220      224       +4     
Impacted Files Coverage Δ
service/internal/extensions/extensions.go 81.81% <0.00%> (ø)
service/service.go 42.16% <40.00%> (+0.37%) :arrow_up:
component/componenttest/nop_host.go 100.00% <100.00%> (ø)
component/status/status.go 100.00% <100.00%> (ø)
service/host.go 100.00% <100.00%> (ø)
service/internal/builder/exporters_builder.go 91.61% <100.00%> (+0.05%) :arrow_up:
service/internal/builder/pipelines_builder.go 87.83% <100.00%> (+0.06%) :arrow_up:
service/internal/builder/receivers_builder.go 82.09% <100.00%> (+0.11%) :arrow_up:
service/internal/components/host_wrapper.go 100.00% <100.00%> (ø)
...ervice/internal/components/status/notifications.go 100.00% <100.00%> (ø)
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 85c26ea...90baf5f. Read the comment docs.

codecov[bot] avatar May 03 '22 02:05 codecov[bot]

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Jun 15 '22 03:06 github-actions[bot]

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Jun 30 '22 03:06 github-actions[bot]

Is there any news about this one?

jpkrohling avatar Jul 04 '22 19:07 jpkrohling

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Jul 19 '22 03:07 github-actions[bot]

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Aug 03 '22 03:08 github-actions[bot]

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Aug 24 '22 03:08 github-actions[bot]

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Sep 19 '22 04:09 github-actions[bot]

@bogdandrutu, do you see this PR being merged anytime soon? I'd rather close it, as this has been with no activity since March, with my message for a status request on Jul 4 going unanswered.

jpkrohling avatar Sep 27 '22 14:09 jpkrohling

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Oct 12 '22 03:10 github-actions[bot]

Closing.

jpkrohling avatar Oct 18 '22 19:10 jpkrohling