opentelemetry-collector
opentelemetry-collector copied to clipboard
[component] component.go - when is component.Start returning errors?
The component.Start
function returns an error.
When is an error expected to be returned upon starting? Configuration validation happens outside of the lifecycle of the component.
The error is handled as reporting a permanent error: https://github.com/open-telemetry/opentelemetry-collector/blob/c5a2c78d614321dcece254f7b79281b84127b5fc/service/internal/graph/graph.go#L396-L402
A few examples from contrib from working with lifecycle tests recently:
- A cassandra exporter starts, but cannot connect to Cassandra right away. It returns an error.
- An aws component starts, but cannot create an aws client, because environment variables are missing. It returns an error.
Typically, we see those when the components cannot pass lifecycle tests: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues?q=is%3Aissue+is%3Aopen+skip+lifecycle
Given that we have component status reporting now, would it make sense to ask component implementers to handle start failures via component status instead of returning an error?
Thanks for opening this issue @atoulme. I don't think that we have well defined rules for how components should handle errors during start, but now that we have component status reporting, and an in progress health check extension that makes use of that data, it makes sense for us to come up with guidelines for how this should work together.
This table has the definitions that I have in mind for each status:
Status | Meaning |
---|---|
Starting | The component is starting. |
OK | The component is running without issue. |
RecoverableError | The component has experienced a transient error and may recover. |
PermanentError | The component has detected a condition at runtime that will need human intervention to fix. The collector will continue to run in a degraded mode. |
FatalError | The collector has experienced a fatal runtime error and will shutdown. |
Stopping | The component is in the process of shutting down. |
Stopped | The component has completed shutdown. |
Now that we have the permanent error state, I think it makes sense to ask whether or not we need fatal error any longer. The reason it was included with status reporting is that it was a preexisting error state. The intent of the permanent error state is to allow the collector to continue running in a degraded mode and alert users that there was a condition detected at runtime that will require human intervention to fix.
I would propose the following guidelines for start:
- If a component experiences a recoverable error, it should report it, but not return an error from start
- If an error is not detected by the end of start, but becomes apparent later, it should be reported as a permanent error. The same should apply to async errors that could arise from start.
- If a non-recoverable error is detected during start a component should, in most cases, report a permanent error and allow the collector to continue to run. In an extreme case, it could return an error from start and trigger the collector to shutdown. We should define what an extreme case would be, if we are unsure of any, we could remove the error return value altogether.
Users can use the health check extension and / or logs to monitor permanent and recoverable errors. If we agree that a collector that starts without an error should continue running, albeit in a possibly degraded state, we can remove fatal errors altogether and communicate these via permanent error status events instead.
In the examples you site:
A cassandra exporter starts, but cannot connect to Cassandra right away. It returns an error.
This should be reported as a recoverable error and recovery should be monitored / assessed via the health check extension. The extension allows a user to set a recovery duration before considering a recoverable error to be considered unhealthy.
An aws component starts, but cannot create an aws client, because environment variables are missing. It returns an error.
Presumably this could be discovered before start as a part of config validation. However, if there are complications that make that impossible, it should report a permanent error and let the collector run in a degraded mode. The permanent error can be monitored via the health check extension and / or logs.
I'll propose an alternative to the guidelines I previously mentioned. The previous proposal is still on the table, as well any other alternatives anyone wants to propose. For this alternative, I suggest that we remove the error return value from start and that we remove FatalError
s altogether. Components will use the status API to report RecoverableError
s or PermanentError
s during startup. The graph will automatically report StatusOK
in the absence of the component reporting an explicit status.
This would eliminate the confusion between a PermanentError
and FatalError
and promote the collector running in a degraded mode rather than shutting down. We can surface failures through StatusWatcher
s, such as the healthcheck extension, and we could consider enhancing status reporting to automatically log PermanentError
s and possibly RecoverableError
s (although recoverables will be more noisy). If we like these guidelines for component.Start
we can adopt the same for component.Shutdown
.
Regardless of the Start
interface we decide on, I think we 100% should be logging any of the Error statuses.
In your alternative approach I like it keeps things unambiguous. Right now we also have no ambiguity - if a component returns an error during Start
for any reason the whole collector stops. The cassandra
and aws component
examples are totally within their rights to return errors and stop the Collector today.
As you mentioned in your guideline proposal, if we try to write rules for when a component should report an error there are always going to be edge cases. We're also leaving the state of the collector up to component authors, hoping they use component status correctly. If a component reports FatalError
instead of PermanentError
incorrectly, our goal of letting the collector startup in a degraded state, where Components with StatusOk
are allowed to keep running, is jeopardized.
With your alternative approach a misused status no longer causes the whole collector to stop. It also is unambiguous about what the collector does when a component has an error on startup - it keeps running.
I think my one fear with the alternative proposal is that it would allow a collector to start with all components experiencing StatusPermanentError
, resulting in 0 components in StatusOK
state. So there'd be an executable running, but nothing happening except the logging/status reporting that the error occurred.
Maybe thats ok? Feels like it will lead to confusion with end users. Is it possible to detect this situation and stop the collector?
Or maybe it is better to leave Start
returning an error
, and add options to the collector that allow ignoring those errors on startup.
I think my one fear with the alternative proposal is that it would allow a collector to start with all components experiencing StatusPermanentError, resulting in 0 components in StatusOK state. So there'd be an executable running, but nothing happening except the logging/status reporting that the error occurred.
This situation is one of the downsides to allowing a collector to run in a degraded state. While not impossible, having all components in a permanent error state should be an edge case.
Is it possible to detect this situation and stop the collector?
The collector itself doesn't keep or aggregate component status. The (new) health check extension does and can surface this information, although, it wouldn't shut the collector down.
In my mind the blockers for this idea are then:
- Are we ok with that edge case?
- Should the behavior be configurable - should the user be able to switch between the current state (shutdown when experiencing an error on startup) and the proposed state (allow the collector to run in a degraded state when erroring on startup?
If we allow configuring the capability then I think there is no problem with the edge case.
My opinion is that we should allow it to be configurable
If the behavior is configurable, I think there are a couple solutions:
- Leave
error
as a return value on the interface. Components would be expected to report a status and return an error on Startup when appropriate.service
would ignore the errors when configured to run in a degraded state. - Remove
error
fromStart
and create a new function on the interface,MustStartt(ctx context.Context, host Host) error
. When configured to shutdown on errorservice
will callMustStart
and when allowed to run in a degraded stateservice
would callStart
.
I'm ultimately ok with allowing Start
to have a fail fast mechanism. My suggestion to consider removing it, was to simplify the distinction between a PermanentError
and FatalError
, and error reporting during Start
generally. Making it configurable only complicates the matter, so I would much rather we pick one.
As things currently stand, the primary way to fail fast is returning an error from Start
, however, because some Start
methods spawn asynchronous work, there is the ReportFatalError API as a second way to fail fast. This is for errors that aren't known at the time Start
returns, but occur as a result of work started in Start
. If we want to allow fail fast from Start
, we should retain the FatalError
status. The distinction is that FatalError
should only be used in from Start
or async work spawned in Start
. During runtime, components should use the PermanentError
status.
To sum up this secondary proposal:
- Change nothing about how
Start
currently works - Retain
FatalError
, but clarify its distinction fromPermanentError
After thinking about this a little more, I think we can retain the behavior of the previous proposal and simplify how errors are be handled during start. I suggest that we remove the error return value from Start
and rely on component status reporting to handle errors during startup. A component can report StatusFatalError
to halt startup in the same way that returning an error does today. This will simplify things by giving us one way to report and handle errors during startup.
To clarify this proposal suggests we:
- Remove the error return value from
Start
- Components wishing to halt startup and stop the collector can report
StatusFatalError
- Clarify that
StatusFatalError
should only be reported fromStart
or async work initiated duringStart
as a fail fast mechanism
Reading through this issue and https://github.com/open-telemetry/opentelemetry-collector/issues/9823, I have an alternative proposal to consider:
- Keep
error
onStart
. If an error is returned fromStart
it means the collector should shutdown.- this is the Collector's current behavior.
- Remove
StatusFatalError
.- If a component experiences an error on start that should stop the collector it should return it. If the error should not stop the collector the component should report
StatusPermanentError
instead.
- If a component experiences an error on start that should stop the collector it should return it. If the error should not stop the collector the component should report
- After each
Start()
has been called and returned, if all receivers are reportingStatusPermanentError
, shutdown.- Maybe this doesn't work because of async work within start that might not have set
StatusPermanentError
?
- Maybe this doesn't work because of async work within start that might not have set
This approach has the benefits of removing any ambiguity between Permanent
and Fatal
status and doesn't require a breaking change to all components. Components that are erroring when they should be reporting StatusPermanentError
would need updating, but the components would still be functional (no breaking change) until that happens.
I think this proposal has a downside and thats how to stop the collector when async work from Start fails. I'd like to better understand this use case, are there any component you can share as examples? I am hoping the async work could report PermanentError
instead of Fatal
OR if it is actually critical for startup, be made synchronous instead.
There are two issues with this proposal. One is that FatalError
s can happen as a result of Start
, but not necessarily by the time Start
returns. This is discussed a little more here: https://github.com/open-telemetry/opentelemetry-collector/issues/9823#issuecomment-2091906055. Here is a real world example in the zipkin receiver: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/33b15734edb65cbd861832030790768883f8ede4/receiver/zipkinreceiver/trace_receiver.go#L89-L115.
The other issue is that the collector does not maintain or aggregate status for components. It dispatches status events to StatusWatcher
s, an optional interface for extensions. The extensions do whatever they see fit with the events. The new healthcheck extension is a StatusWatcher
that aggregates events. It could detect this situation, and report a FatalError, supposing we keep it, thereby shutting down the collector. This is a reasonable option to add to the extension, but it doesn't help us eliminate FatalError or simplify the return value of Start
.
@mwear in that specific zipkin example, would it be inappropriate to report a PermanentError
instead? It feels like at that point in the startup process all synchronous checks have passed, and whats left is to startup the server. If the server doesn't start up is that PermanentError
or a FatalError
? If other components/pipelines are able to start it feels more like a PermanentError
for this receiver.
The other issue is that the collector does not maintain or aggregate status for components.
Could it? Is there a reason why the collector's Run loop couldn't read from the Status channel and react accordingly?
If the server doesn't start up is that PermanentError or a FatalError?
The FatalError reported by the Zipkin receiver and FatalErrors generally are a result of the host.ReportFatalError
API that predates component status reporting. While introducing component status reporting, it was pointed out that we could and probably should replace host.ReportFatalError
with the status reporting equivalent. It was ultimately deprecated, replaced, and removed. For this reason, all existing calls to host.ReportFatalError
were replaced with ReportStatus with a FatalError event, see: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/30501. As far as I know, this API was introduced to allow the collector to fail fast from async work initiated in component.Start. Synchronous failures have always used an error return value. The tl;dr is that nothing has fundamentally changed RE FatalErrors since the introduction of status reporting. What has changed is the API called to initiate the fail fast behavior. Changing this to be a PermanentError would be a change in behavior we could consider, but I think FatalError still has a place for reasons I'll describe.
Could it? Is there a reason why the collector's Run loop couldn't read from the Status channel and react accordingly?
It could, but I don't think that it should. This goes against the design of component status reporting. The system was designed with the idea that components report their status via events, the collector automates this where possible, and it dispatches the events to extensions (implementers of the StatusWatcher interface). The extensions can process events as they see fit. The collector provides the machinery to pass events between components and extensions. It makes no attempt to interpret the events. That responsibility is left to extensions.
As a concrete example, we can briefly mention the new health check extension. It has an event aggregation system in order to answer questions about pipeline health and overall collector health, however, this system is specific to the extension; it would not likely be something that would be adopted into the core collector. The aggregation itself varies based on user configuration. There is a preliminary set of of knobs to tune aggregation depending on what the user wants to consider healthy or not, and there will likely be many more to follow. By default both Permanent and Recoverable errors are considered healthy. Users have to opt-in to consider either, or both to be unhealthy. Recoverables have the additional feature that you can specify a recovery interval during which they should recover. They will be considered healthy until the interval has elapsed and unhealthy afterwards. While the health check extension is an early example of a StatusWatcher, I expect more will be added over time. I believe we will eventually have an extension to export the raw events (as an OTLP logs) to backends that will process them according to their own rules. Getting back on topic, component status reporting was not designed to determine what is healthy, only to facilitate the reporting of events between components and extensions; the extensions handle the events as they choose. Moving aggregation into the collector, or having a parallel system is not really in line with the original design and I don't think there is a good reason to do this if we keep FatalError as a fail-fast mechanism.
With https://github.com/open-telemetry/opentelemetry-collector/pull/10413 merged, the decision to keep Start
returning an error is solidified.