api icon indicating copy to clipboard operation
api copied to clipboard

MON-4359: Add Monitoring Capability

Open rexagod opened this issue 5 months ago • 13 comments

Add a "Monitoring" capability to support optional monitoring.

rexagod avatar Sep 03 '25 22:09 rexagod

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

openshift-ci[bot] avatar Sep 03 '25 22:09 openshift-ci[bot]

Hello @rexagod! Some important instructions when contributing to openshift/api: API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

openshift-ci[bot] avatar Sep 03 '25 22:09 openshift-ci[bot]

@rexagod: This pull request references MON-4359 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.21.0" version, but no target version was set.

In response to this:

Add a "Monitoring" capability to support optional monitoring.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Sep 03 '25 23:09 openshift-ci-robot

/assign

everettraven avatar Sep 29 '25 13:09 everettraven

@everettraven Apologies! It seems there was some development before I got a chance to update this after marking it ready.

I'll make changes to implicitly enable the capability for TechPreview only in 4.21.

rexagod avatar Oct 12 '25 20:10 rexagod

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from everettraven. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Oct 12 '25 22:10 openshift-ci[bot]

@everettraven I'm probably missing something but I expected to find something similar to the feature-gates workflow (annotations/tags and such) to enable this just for TechPreview. Could you please drop some pointers on how I can achieve that? Thanks!

rexagod avatar Oct 12 '25 22:10 rexagod

@rexagod: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Oct 12 '25 22:10 openshift-ci[bot]

@rexagod As far as I am aware, the process for tech preview capabilities is the same as adding any new enum value under TechPreviewNoUpgrade, meaning:

  • An OpenShift Enhancement Proposal
  • A new feature gate
  • Adding the new enum using a feature-gate aware enum marker

I'm relatively new to capabilities for OpenShift so I read up on the original intention of them in https://github.com/openshift/enhancements/blob/df1fada1e045b2e81b677941414f3085c02945fd/enhancements/installer/component-selection.md

With my current understanding of how capabilities work it should be an all-or-nothing type configuration. It seems like an anti-pattern to have a partial capability.

I'd like to see answered in the EP you write up:

  • Why your cluster operator cannot be responsible for removing optional monitoring configurations
  • What is and is not included in this capability and why it is okay to exclude some monitoring components and not others
  • What other components may have a dependency on these now optional components

Thanks!

Edit to clarify: Re-reading the description of the capability, I'm not sure a capability is what you are actually looking for here. This seems more like an option that should be configured on and managed by the CMO itself.

everettraven avatar Oct 13 '25 14:10 everettraven

👋🏼 I'm a bit confused about the EP and feature-gate part, since https://github.com/openshift/enhancements/blob/df1fada1e045b2e81b677941414f3085c02945fd/enhancements/installer/component-selection.md#how-to-implement-a-new-capability doesn't mention those?

What is and is not included in this capability and why it is okay to exclude some monitoring components and not others

I've tried to answer this in the godoc for the capability but essentially this allows us to tune down the resource consumption of the in-cluster monitoring stack by not deploying optional components.

Why your cluster operator cannot be responsible for removing optional monitoring configurations Re-reading the description of the capability, I'm not sure a capability is what you are actually looking for here. This seems more like an option that should be configured on and managed by the CMO itself.

We chose to go with capabilities to support configuring these at install time for clusters that are only concerned with forwarding telemetry metrics.

What other components may have a dependency on these now optional components

The optional components in question are Alertmanager and all UserWorkloadMonitoring components managed by CMO. The latter is opt-out by default, so no dependencies there. For AM, as I mentioned earlier, this could impact any areas that expect an AM presence 100% of the time (so far, monitoring-plugin, which we'll fix down the line to hide the "Alerts" tab when AM is absent).

rexagod avatar Oct 13 '25 16:10 rexagod

I'm a bit confused about the EP and feature-gate part, since https://github.com/openshift/enhancements/blob/df1fada1e045b2e81b677941414f3085c02945fd/enhancements/installer/component-selection.md#how-to-implement-a-new-capability doesn't mention those?

This document looks like it hasn't been updated in the last couple years. Our processes have changed throughout this time such that any API change, especially to a v1 API, must go behind a feature gate in TechPreviewNoUpgrade. Creating a feature gate requires an OpenShift enhancement proposal. This is meant to ensure that we:

  • Don't inadvertently ship something as GA without a backing implementation
  • Don't GA something without appropriate regression testing

This is all meant to ensure we are maintaining a quality bar for changes that go into the product.

I've tried to answer this in the godoc for the capability but essentially this allows us to tune down the resource consumption of the in-cluster monitoring stack by not deploying optional components.

I appreciate you taking the time to answer some of this in the GoDoc. I'm still not quite sure I have an understanding from reading that why some pieces of the monitoring stack we deploy by default is considered optional and why others are not.

We chose to go with capabilities to support configuring these at install time for clusters that are only concerned with forwarding telemetry metrics.

My understanding of capabilities is that they are all-or-nothing type mechanisms to tell the ClusterVersionOperator whether or not to stamp $thing out onto the cluster, not a way to tell cluster operators what they should and should not be doing. I don't think this partial use case is one that the "capability" concept was intended to support.

I'm not sure off the top of my head if there is another way to achieve what you are looking for at install time, but maybe @JoelSpeed would be more familiar with the options you have here.

I've got a few questions here that having an EP would be helpful to field these discussion:

  • Why does a user need to configure this at install time?
  • What is the impact to an end user if they can only configure this option as a "day 2" operation?
  • Should users be able to enable/disable the optional pieces of the monitoring stack at will?

everettraven avatar Oct 14 '25 12:10 everettraven

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot avatar Jan 14 '26 09:01 openshift-bot

[!IMPORTANT]

Review skipped

Auto reviews are limited based on label configuration.

:no_entry_sign: Review skipped — only excluded labels are configured. (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Jan 14 '26 09:01 coderabbitai[bot]