opentelemetry-collector icon indicating copy to clipboard operation
opentelemetry-collector copied to clipboard

[component] Move component status features to its own module

Open TylerHelmuth opened this issue 1 year ago • 14 comments

Description

This PR is an attempt are removing component status reporting features from the component module. The technique was to make a new componentstatus.StatusReporter interface that component.Host could optionally implement. This has impacts:

  1. We no longer need a servicetelemetry.TelemetrySettings struct.
  2. Components no longer get access to a ReportStatus function until the component.Start method provides the host.
  3. Components that use sharedcomponent will now only report a status for instances that have been started when an error in start occurs. As an example if an otlpreceiver is used in a traces, metrics, and logs pipeline, when the host make the initial call to Start, say for the traces pipeline, sharedcomponent will only report the error returned from Start for the traces instance, and then continue on to shutdown.

If this solution's general idea is tenable, then I believe it could be broken down into slightly smaller PRs, such as:

  1. Remove the need for servicetelemetry.TelemetrySettings: https://github.com/open-telemetry/opentelemetry-collector/pull/10728
  2. Move status stuff into its own module: https://github.com/open-telemetry/opentelemetry-collector/pull/10730
  3. Remove ReportStatus from TelemetrySettings and implement the StatusReporter interface: https://github.com/open-telemetry/opentelemetry-collector/pull/10777

Link to tracking issue

Related to https://github.com/open-telemetry/opentelemetry-collector/pull/10413

TylerHelmuth avatar Jul 25 '24 18:07 TylerHelmuth

Codecov Report

Attention: Patch coverage is 81.27660% with 44 lines in your changes missing coverage. Please review.

Project coverage is 92.12%. Comparing base (95902c1) to head (900cebe). Report is 1 commits behind head on main.

Files Patch % Lines
service/internal/graph/graph.go 77.41% 21 Missing :warning:
service/service.go 69.56% 0 Missing and 7 partials :warning:
receiver/otlpreceiver/otlp.go 0.00% 6 Missing :warning:
internal/sharedcomponent/sharedcomponent.go 87.87% 4 Missing :warning:
extension/zpagesextension/zpagesextension.go 0.00% 3 Missing :warning:
processor/processortest/unhealthy_processor.go 50.00% 1 Missing and 1 partial :warning:
service/internal/status/status.go 97.05% 0 Missing and 1 partial :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10725      +/-   ##
==========================================
- Coverage   92.18%   92.12%   -0.06%     
==========================================
  Files         403      400       -3     
  Lines       18792    18816      +24     
==========================================
+ Hits        17323    17335      +12     
- Misses       1109     1121      +12     
  Partials      360      360              

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Jul 25 '24 19:07 codecov[bot]

One other note - component.InstanceID was added for the sake of component status (though in theory it may be useful otherwise). I think we may want to move it out of component as well.

djaglowski avatar Jul 26 '24 15:07 djaglowski

@djaglowski I've moved InstanceID to componentstatus (which in the actually implementation I'll handle by adding it when the content is moved to its own package and deprecating the one in component).

@mx-psi the fact that this move is possible makes me think maybe https://github.com/open-telemetry/opentelemetry-collector/pull/10222 is back on the table, but I need to think about it more.

TylerHelmuth avatar Jul 26 '24 16:07 TylerHelmuth

@mwear the key change happens when a component.Start returns a synchronous error, which gets back to the host and ends up terminating the startup process. The other signals for the sharedcomponent won't receive the NewPermanent error, but also they never receiver a Starting status either. They get not statuses reported for them, like another component further down the line that also did not get started.

TylerHelmuth avatar Jul 26 '24 21:07 TylerHelmuth

What about the recoverable error case? When reporting status from within the component (I guess this is via host now), it should report status for all of the components it represents. All components represented by the shared component must be in the Starting state before the API can be from called from "within" the shared component. If you report RecoverableError during start, but some instances are not Starting, you will get an invalid state transition for the unstarted instances, and those instances will be stuck reporting an invalid status. I'm not sure if this makes sense, but this is my concern with this change. It's a little hard to test this end to end with the health check extension to verify because there are so many changes. In any case, if these words make any sense, and are accurate, we may have to handle this another way.

mwear avatar Jul 26 '24 21:07 mwear

It's possible that the way you are handling previous events solves my concerns. I'd still like to find a way to verify this end to end.

mwear avatar Jul 26 '24 21:07 mwear

@mwear I'll see if I can add a test case in this PR with that example.

TylerHelmuth avatar Jul 29 '24 14:07 TylerHelmuth

@mwear I've added a very rough e2e test in internal/e2e/status_test.go that tests that a component that is using sharedcomponent does end up eventually reporting all statuses for each instance and that an extension watching for statuses does receive all the events.

This gives me confidence that the statuses do end up in the extensions correctly. How the extension handles the statuses is up to the extension right?

One thing of note is that the existing (in main) graph/sharedinstance implementation runs into an invalid state in the graph's fsm because the graph reports StatusStarting for the current pipeline and the sharedinstance implementation also reports StatusStarting for all instances, one of which was already reported as StatusStarting.

I highlight this as a datapoint that shows that status reporting still needs lots of maturing, and while this PR changes a lot about how status reporting works, I think that is ok since it is still so young. I feel confident that if we do find issues with this implementation later on that we'll be able to make fixes via graph/sharedinstance/componentstatus without needing to break user APIs or component behavior.

TylerHelmuth avatar Jul 29 '24 17:07 TylerHelmuth

Thanks for the test @TylerHelmuth, this gives me more confidence. I'm aware of the repeat status reporting that was part of the original implementation. The finite state machine is meant to protect and enforce the lifecycle of a component, and invalid state transitions are not necessarily errors, in many cases should be considered a no-op. I know there is some logging around this that I would be in favor or removing because it gives the impression something unexpected happened, when that's not exactly the case. This is discussed in the status reporting documentation under the runtime section.

During runtime a component should not have to keep track of its state. A component should report status as operations succeed or fail and the finite state machine will handle the rest. Changes in status will result in new status events being emitted. Repeat reports of the same status will no-op. Similarly, attempts to make an invalid state transition, such as PermanentError to OK, will have no effect.

mwear avatar Jul 29 '24 20:07 mwear

@mwear do you feel comfortable enough to move forward with this approach via https://github.com/open-telemetry/opentelemetry-collector/pull/10730?

TylerHelmuth avatar Jul 29 '24 22:07 TylerHelmuth

My biggest reservation is that we need the collector to be observable for it to be 1.0. Part of this means having a functioning health check extension. The current replacement requires component status, so I don't think we can call the collector 1.0 without having component status enabled by default. If we are moving this code for organization purposes, that's fine, but I don't think we can skip this work for 1.0 altogether.

mwear avatar Jul 29 '24 22:07 mwear

Part of this means having a functioning health check extension

The healthcheck extension is not in our GA roadmap. I think it's important that we keep working on it, but our current plans don't include this for 1.0 as defined in the GA roadmap

mx-psi avatar Jul 30 '24 09:07 mx-psi

Ive created https://github.com/open-telemetry/opentelemetry-collector/pull/10777 to complete the transition in core.

TylerHelmuth avatar Jul 31 '24 21:07 TylerHelmuth

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Aug 15 '24 03:08 github-actions[bot]

Closed as inactive. Feel free to reopen if this PR is still being worked on.

github-actions[bot] avatar Aug 29 '24 03:08 github-actions[bot]