opentelemetry-collector icon indicating copy to clipboard operation
opentelemetry-collector copied to clipboard

Define guideline for components to deprecate configuration

Open codeboten opened this issue 2 years ago • 4 comments

The following change in contrib caused a breaking change for a component identified as stable. This would cause existing users of the component to break when upgrading their collector if they were using the configuration options removed.

Currently the policy for breaking changes is defined for changes to the API, but it's unclear how components should apply this policy for configuration changes.

This issue is a discussion to determine:

  1. what the guidelines should be for configuration changes/deprecation
  2. how components can be marked stable and if they even can be stable while the collector is still in beta
  3. if there should be separate versioning for individual components. Currently all components within a release have the same version. Separating component versions could allow components to change major versions and follow semver, although it's still unclear if this would be enough considering the overall version of the collector wouldn't change.

Pinging the folks that were involved in a discussion in a private channel to add their comments here @Aneurysm9 @jpkrohling @dashpole @mx-psi

Note: I understand the issue was caused by a change in the contrib repo, but since the stability definition is in this repo, I figured it made more sense to open the issue here.

codeboten avatar Jul 14 '22 14:07 codeboten

Here's a first idea to debate, although I think it might go against @Aneurysm9's idea.

  1. we will do our best to not add new required options, nor remove existing options, from components, but we recognize that we might need to, either because the components we interact with have changed, or because we evolved
  2. if we add something, we'll do our best to set a reasonable default. We will also consider not enabling features by default that might cause considerable performance impact to the application (memory, cpu, pipeline latency). I know this is too abstract, but what "considerable" means might differ from component to component.
  3. if we are replacing a property, we'll do it in at least two steps: first, we'll add the new property and convert the old one into the new one, logging an INFO-level message. On at least N-2 to the version we intend to remove this property, we increase the level of the message to WARN if we see the old one is used. On at least N-1 of the version we intend to remove, we log an ERROR if we see the old one is used.
  4. if we are removing property without replacement, we should follow the same as above regarding the log messages.

jpkrohling avatar Jul 14 '22 17:07 jpkrohling

Small note: The PR above didn't make any breaking changes. It just added warning messages when using the "old" fields. If we want to impose additional requirements for removal, we still can.

My personal opinion is that logs are not very useful for announcing deprecations. I don't generally look at logs unless I'm debugging a particular problem, at which point I am usually not looking at the logs printed at startup. The better pattern is to fail on startup with an easy way to revert to the previous behavior. I would prefer either using feature gates for all breaking changes, or introducing semvar for components with the ability for users to choose a major version.

dashpole avatar Jul 14 '22 18:07 dashpole

Getting to a place where we can use semvar is the ultimate solution, but does it requires Core to have an official 1.0.0 release? It would feel weird to say a component is at version 1.0.0 while completely depending on the Core/Contrib libraries, which would be at 0.55.0 and could cause breaking changes to the component at any moment.

TylerHelmuth avatar Jul 14 '22 21:07 TylerHelmuth

what the guidelines should be for configuration changes/deprecation

A combination of @jpkrohling's and @dashpole's points seems reasonable here. Some people do check logs, so we want to have logs for this set of people before we switch to failing on startup. To get to a compromise where we all feel okay with the guidelines, we should maybe have a longer minimum period of deprecation (I think @Aneurysm9 suggested 6 months, maybe that's okay for 'stable' components).

I would also like to suggest requiring a tracking issue for breaking changes, that has migration instructions and is referenced in logs, errors and feature gate descriptions. I tried to do this for the Datadog exporter (e.g. see open-telemetry/opentelemetry-collector-contrib/issues/8845), we can take that as a starting point.

We should also clarify to what stability level these rules apply (they probably shouldn't apply to alpha or in development, they may apply to beta with a smaller deprecation period).

how components can be marked stable and if they even can be stable while the collector is still in beta

Should we use a different wording instead of stable? Maybe 'generally available' or 'ready for production' are acceptable terms. IMO it's not reasonable to have a stability level that says 'no breaking changes ever' at this point, so if it's confusing we can just change the name.

if there should be separate versioning for individual components

My opinion is that we must not make 1.0 releases of components for now. The Go API is nowhere near ready to be 1.0, and we don't want to convey that we follow semver when we don't. We need to have a Core 1.0 for that, and then carefully review the Go API of each component to ensure we don't expose anything we don't want to.

If we don't have 1.0, I don't think having a different versioning schema for stable components makes a lot of sense, so I would remove versioning from the discussion.

mx-psi avatar Jul 15 '22 11:07 mx-psi

As mentioned on open-telemetry/opentelemetry-collector-contrib/issues/4055, our current versioning guidelines require that stable components on both core and contrib have a stable public Go API, I think we should mention this explicitly as a requirement for the 'stable' stability level

mx-psi avatar Jan 23 '23 09:01 mx-psi

@mx-psi, @codeboten, @TylerHelmuth, if you haven't started working on this, I could work on a PR based on the current state of this discussion.

@Aneurysm9, @dashpole, could you please have another look at this issue and let me know if you have concerns not addressed yet?

jpkrohling avatar Jul 05 '23 18:07 jpkrohling

My request is that if/when breaking changes to configuration or behavior take effect, users have an easy way to revert to the previous config or behavior for a period of time. That is what using a feature gate provides, since beta feature gates can still be disabled by users. I would add to the Stable component definition:

When making breaking changes to configuration or behavior, the change MUST be gated by a feature gate, and MUST spend at least 6 months in beta.

dashpole avatar Jul 05 '23 19:07 dashpole

if you haven't started working on this, I could work on a PR based on the current state of this discussion.

Sounds good to me, I would be interested in reviewing :)

mx-psi avatar Jul 06 '23 09:07 mx-psi

I added a first draft for this. I took both the current stability levels for collector components and the maturity levels OTEP to write this one. I think it reflects what was discussed in this issue as well.

jpkrohling avatar Aug 25 '23 10:08 jpkrohling