opentelemetry-collector
opentelemetry-collector copied to clipboard
[component] Move component status features to its own module
Description
This PR is an attempt are removing component status reporting features from the component module. The technique was to make a new componentstatus.StatusReporter interface that component.Host could optionally implement. This has impacts:
- We no longer need a
servicetelemetry.TelemetrySettingsstruct. - Components no longer get access to a
ReportStatusfunction until thecomponent.Startmethod provides the host. - Components that use sharedcomponent will now only report a status for instances that have been started when an error in start occurs. As an example if an otlpreceiver is used in a traces, metrics, and logs pipeline, when the host make the initial call to
Start, say for the traces pipeline, sharedcomponent will only report the error returned fromStartfor the traces instance, and then continue on to shutdown.
If this solution's general idea is tenable, then I believe it could be broken down into slightly smaller PRs, such as:
- Remove the need for
servicetelemetry.TelemetrySettings: https://github.com/open-telemetry/opentelemetry-collector/pull/10728 - Move status stuff into its own module: https://github.com/open-telemetry/opentelemetry-collector/pull/10730
- Remove ReportStatus from TelemetrySettings and implement the
StatusReporterinterface: https://github.com/open-telemetry/opentelemetry-collector/pull/10777
Link to tracking issue
Related to https://github.com/open-telemetry/opentelemetry-collector/pull/10413
Codecov Report
Attention: Patch coverage is 81.27660% with 44 lines in your changes missing coverage. Please review.
Project coverage is 92.12%. Comparing base (
95902c1) to head (900cebe). Report is 1 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #10725 +/- ##
==========================================
- Coverage 92.18% 92.12% -0.06%
==========================================
Files 403 400 -3
Lines 18792 18816 +24
==========================================
+ Hits 17323 17335 +12
- Misses 1109 1121 +12
Partials 360 360
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
One other note - component.InstanceID was added for the sake of component status (though in theory it may be useful otherwise). I think we may want to move it out of component as well.
@djaglowski I've moved InstanceID to componentstatus (which in the actually implementation I'll handle by adding it when the content is moved to its own package and deprecating the one in component).
@mx-psi the fact that this move is possible makes me think maybe https://github.com/open-telemetry/opentelemetry-collector/pull/10222 is back on the table, but I need to think about it more.
@mwear the key change happens when a component.Start returns a synchronous error, which gets back to the host and ends up terminating the startup process. The other signals for the sharedcomponent won't receive the NewPermanent error, but also they never receiver a Starting status either. They get not statuses reported for them, like another component further down the line that also did not get started.
What about the recoverable error case? When reporting status from within the component (I guess this is via host now), it should report status for all of the components it represents. All components represented by the shared component must be in the Starting state before the API can be from called from "within" the shared component. If you report RecoverableError during start, but some instances are not Starting, you will get an invalid state transition for the unstarted instances, and those instances will be stuck reporting an invalid status. I'm not sure if this makes sense, but this is my concern with this change. It's a little hard to test this end to end with the health check extension to verify because there are so many changes. In any case, if these words make any sense, and are accurate, we may have to handle this another way.
It's possible that the way you are handling previous events solves my concerns. I'd still like to find a way to verify this end to end.
@mwear I'll see if I can add a test case in this PR with that example.
@mwear I've added a very rough e2e test in internal/e2e/status_test.go that tests that a component that is using sharedcomponent does end up eventually reporting all statuses for each instance and that an extension watching for statuses does receive all the events.
This gives me confidence that the statuses do end up in the extensions correctly. How the extension handles the statuses is up to the extension right?
One thing of note is that the existing (in main) graph/sharedinstance implementation runs into an invalid state in the graph's fsm because the graph reports StatusStarting for the current pipeline and the sharedinstance implementation also reports StatusStarting for all instances, one of which was already reported as StatusStarting.
I highlight this as a datapoint that shows that status reporting still needs lots of maturing, and while this PR changes a lot about how status reporting works, I think that is ok since it is still so young. I feel confident that if we do find issues with this implementation later on that we'll be able to make fixes via graph/sharedinstance/componentstatus without needing to break user APIs or component behavior.
Thanks for the test @TylerHelmuth, this gives me more confidence. I'm aware of the repeat status reporting that was part of the original implementation. The finite state machine is meant to protect and enforce the lifecycle of a component, and invalid state transitions are not necessarily errors, in many cases should be considered a no-op. I know there is some logging around this that I would be in favor or removing because it gives the impression something unexpected happened, when that's not exactly the case. This is discussed in the status reporting documentation under the runtime section.
During runtime a component should not have to keep track of its state. A component should report status as operations succeed or fail and the finite state machine will handle the rest. Changes in status will result in new status events being emitted. Repeat reports of the same status will no-op. Similarly, attempts to make an invalid state transition, such as PermanentError to OK, will have no effect.
@mwear do you feel comfortable enough to move forward with this approach via https://github.com/open-telemetry/opentelemetry-collector/pull/10730?
My biggest reservation is that we need the collector to be observable for it to be 1.0. Part of this means having a functioning health check extension. The current replacement requires component status, so I don't think we can call the collector 1.0 without having component status enabled by default. If we are moving this code for organization purposes, that's fine, but I don't think we can skip this work for 1.0 altogether.
Part of this means having a functioning health check extension
The healthcheck extension is not in our GA roadmap. I think it's important that we keep working on it, but our current plans don't include this for 1.0 as defined in the GA roadmap
Ive created https://github.com/open-telemetry/opentelemetry-collector/pull/10777 to complete the transition in core.
This PR was marked stale due to lack of activity. It will be closed in 14 days.
Closed as inactive. Feel free to reopen if this PR is still being worked on.