configuration-anomaly-detection SREP-1881: Automated must-gathers ROSA classic implementation

What type of PR is this?

feature

What this PR does / Why we need it?

This PR implements automated must-gather collection and upload for ROSA classic clusters, reducing manual SRE effort when diagnosing cluster issues. Note that this PR covers only ROSA classic for now, as the HCP requires the changes here: https://issues.redhat.com/browse/SREP-2330

Key Features:

Automatically collects OpenShift diagnostics via oc adm must-gather when triggered by PagerDuty "CreateMustGather" alert
Compresses diagnostics into a tarball and uploads to Red Hat SFTP server using anonymous temporary credentials
Posts upload location to PagerDuty incident notes for easy SRE access
Tracks success metrics via Prometheus (cad_investigate_must_gather_performed_total)
Escalates to primary on any failure for immediate attention (likely to change in future)

Why we need it: Currently, SREs must manually:

Connect to cluster via backplane
Run oc adm must-gather
Download and upload diagnostics to SFTP
Share location with team

This automation eliminates these manual steps, providing immediate diagnostic data access and reducing time-to-resolution for cluster issues.

Special notes for your reviewer

Note that when testing this locally, the metadata.yml file must be commited to the main branch when using the local backplane setup

  git fetch origin SREP-1881-automated-must-gathers-investigation && \
  git checkout main && \
  git checkout origin/SREP-1881-automated-must-gathers-investigation -- pkg/investigations/mustgather/metadata.yaml && \
  git add pkg/investigations/mustgather/metadata.yaml && \
  git commit -m "Add mustgather metadata.yaml for local testing" && \
  git checkout -

SFTP Security:
- Uses temporary anonymous credentials from Red Hat SFTP API (time-limited)
- Validates server SSH fingerprint: SHA256:Ij7dPhl1PhiycLC/rFXy1sGO2nSS9ky0PYdYhi+ykpQ
- SFTP upload instructions are publicly documented at https://access.redhat.com/articles/5594481
Performance:
- Typical 50MB must-gather: ~7 minutes end-to-end (5 minutes for SFTP upload at ~10 MB/min)
- 6-hour timeout configured for large files
- Context-aware I/O allows graceful cancellation
Metrics Implementation:
- Only records success (failures derived from alerts_total - must_gather_performed_total)
- Label: "ROSA classic" to identify cluster type
- Grafana Dashboard: Panel for must-gather metrics will be added later once real data points are available for proper visualization tuning
Future Enhancements (documented in README):
- Retry logic for transient failures
- Threshold-based alerting instead of immediate escalation
- HCP/Hypershift support (pending backplane support)

Test Coverage

Guidelines for CAD investigations

New investgations should be accompanied by unit tests and/or step-by-step manual tests in the investigation README.
Actioning investigations should be locally tested in staging, and E2E testing is desired. See README for more info on investigation graduation process.

Test coverage checks

[x] Added tests
[ ] Created jira card to add unit test
[ ] This PR may not need unit tests

Pre-checks (if applicable)

[x] Ran unit tests locally
[x] Validated the changes in a cluster
[x] Included documentation changes with PR

Nov 28 '25 13:11 rolandmkunkel

@rolandmkunkel: This pull request references SREP-1881 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

What type of PR is this?

feature

What this PR does / Why we need it?

This PR implements automated must-gather collection and upload for ROSA classic clusters, reducing manual SRE effort when diagnosing cluster issues. Note that this PR covers only ROSA classic for now, as the HCP requires the changes here: https://issues.redhat.com/browse/SREP-2330

Key Features:

Automatically collects OpenShift diagnostics via oc adm must-gather when triggered by PagerDuty "CreateMustGather" alert

Compresses diagnostics into a tarball and uploads to Red Hat SFTP server using anonymous temporary credentials

Posts upload location to PagerDuty incident notes for easy SRE access

Tracks success metrics via Prometheus (cad_investigate_must_gather_performed_total)

Escalates to primary on any failure for immediate attention (likely to change in future)

Why we need it: Currently, SREs must manually:

Connect to cluster via backplane

Run oc adm must-gather

Download and upload diagnostics to SFTP

Share location with team

This automation eliminates these manual steps, providing immediate diagnostic data access and reducing time-to-resolution for cluster issues.

Special notes for your reviewer

SFTP Security:

Uses temporary anonymous credentials from Red Hat SFTP API (time-limited)

Validates server SSH fingerprint: SHA256:Ij7dPhl1PhiycLC/rFXy1sGO2nSS9ky0PYdYhi+ykpQ

SFTP upload instructions are publicly documented at https://access.redhat.com/articles/5594481

Performance:

Typical 50MB must-gather: ~7 minutes end-to-end (5 minutes for SFTP upload at ~10 MB/min)

6-hour timeout configured for large files

Context-aware I/O allows graceful cancellation

Metrics Implementation:

Only records success (failures derived from alerts_total - must_gather_performed_total)

Label: "ROSA classic" to identify cluster type

Grafana Dashboard: Panel for must-gather metrics will be added later once real data points are available for proper visualization tuning

Future Enhancements (documented in README):

Retry logic for transient failures

Threshold-based alerting instead of immediate escalation

HCP/Hypershift support (pending backplane support)

Test Coverage

Guidelines for CAD investigations

New investgations should be accompanied by unit tests and/or step-by-step manual tests in the investigation README.

Actioning investigations should be locally tested in staging, and E2E testing is desired. See README for more info on investigation graduation process.

Test coverage checks

[x] Added tests

[] Created jira card to add unit test

[ ] This PR may not need unit tests

Pre-checks (if applicable)

[x] Ran unit tests locally

[x] Validated the changes in a cluster

[x] Included documentation changes with PR

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Nov 28 '25 13:11 openshift-ci-robot

Codecov Report

:x: Patch coverage is 33.15789% with 127 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 35.37%. Comparing base (6fe1328) to head (3c3224c). :warning: Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/investigations/mustgather/sftpUpload.go	39.36%	53 Missing and 4 partials :warning:
pkg/investigations/mustgather/mustgather.go	0.00%	51 Missing :warning:
pkg/investigations/utils/tarball/tarball.go	61.90%	8 Missing and 8 partials :warning:
cadctl/cmd/investigate/investigate.go	0.00%	2 Missing :warning:
pkg/metrics/metrics.go	0.00%	1 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #628      +/-   ##
==========================================
- Coverage   35.56%   35.37%   -0.19%     
==========================================
  Files          43       46       +3     
  Lines        2829     3022     +193     
==========================================
+ Hits         1006     1069      +63     
- Misses       1745     1863     +118     
- Partials       78       90      +12

Files with missing lines	Coverage Δ
pkg/investigations/investigation/investigation.go	`15.83% <ø> (ø)`
pkg/investigations/registry.go	`0.00% <ø> (ø)`
pkg/ocm/ocm.go	`0.00% <ø> (ø)`
pkg/metrics/metrics.go	`0.00% <0.00%> (ø)`
cadctl/cmd/investigate/investigate.go	`0.00% <0.00%> (ø)`
pkg/investigations/utils/tarball/tarball.go	`61.90% <61.90%> (ø)`
pkg/investigations/mustgather/mustgather.go	`0.00% <0.00%> (ø)`
pkg/investigations/mustgather/sftpUpload.go	`39.36% <39.36%> (ø)`

... and 1 file with indirect coverage changes

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Nov 28 '25 13:11 codecov-commenter

@rolandmkunkel: This pull request references SREP-1881 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

What type of PR is this?

feature

What this PR does / Why we need it?

This PR implements automated must-gather collection and upload for ROSA classic clusters, reducing manual SRE effort when diagnosing cluster issues. Note that this PR covers only ROSA classic for now, as the HCP requires the changes here: https://issues.redhat.com/browse/SREP-2330

Key Features:

Automatically collects OpenShift diagnostics via oc adm must-gather when triggered by PagerDuty "CreateMustGather" alert

Compresses diagnostics into a tarball and uploads to Red Hat SFTP server using anonymous temporary credentials

Posts upload location to PagerDuty incident notes for easy SRE access

Tracks success metrics via Prometheus (cad_investigate_must_gather_performed_total)

Escalates to primary on any failure for immediate attention (likely to change in future)

Why we need it: Currently, SREs must manually:

Connect to cluster via backplane

Run oc adm must-gather

Download and upload diagnostics to SFTP

Share location with team

This automation eliminates these manual steps, providing immediate diagnostic data access and reducing time-to-resolution for cluster issues.

Special notes for your reviewer

Note that when testing this locally, the metadata.yml file must be commited to the main branch when using the local backplane setup
 git fetch origin [SREP-1881](https://issues.redhat.com//browse/SREP-1881)-automated-must-gathers-investigation && \
 git checkout main && \
 git checkout origin/SREP-1881-automated-must-gathers-investigation -- pkg/investigations/mustgather/metadata.yaml && \
 git add pkg/investigations/mustgather/metadata.yaml && \
 git commit -m "Add mustgather metadata.yaml for local testing" && \
 git checkout -
SFTP Security:

Uses temporary anonymous credentials from Red Hat SFTP API (time-limited)

Validates server SSH fingerprint: SHA256:Ij7dPhl1PhiycLC/rFXy1sGO2nSS9ky0PYdYhi+ykpQ

SFTP upload instructions are publicly documented at https://access.redhat.com/articles/5594481

Performance:

Typical 50MB must-gather: ~7 minutes end-to-end (5 minutes for SFTP upload at ~10 MB/min)

6-hour timeout configured for large files

Context-aware I/O allows graceful cancellation

Metrics Implementation:

Only records success (failures derived from alerts_total - must_gather_performed_total)

Label: "ROSA classic" to identify cluster type

Grafana Dashboard: Panel for must-gather metrics will be added later once real data points are available for proper visualization tuning

Future Enhancements (documented in README):

Retry logic for transient failures

Threshold-based alerting instead of immediate escalation

HCP/Hypershift support (pending backplane support)

Test Coverage

Guidelines for CAD investigations

New investgations should be accompanied by unit tests and/or step-by-step manual tests in the investigation README.

Actioning investigations should be locally tested in staging, and E2E testing is desired. See README for more info on investigation graduation process.

Test coverage checks

[x] Added tests

[] Created jira card to add unit test

[ ] This PR may not need unit tests

Pre-checks (if applicable)

[x] Ran unit tests locally

[x] Validated the changes in a cluster

[x] Included documentation changes with PR

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Nov 28 '25 13:11 openshift-ci-robot

@rolandmkunkel: This pull request references SREP-1881 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

What type of PR is this?

feature

What this PR does / Why we need it?

This PR implements automated must-gather collection and upload for ROSA classic clusters, reducing manual SRE effort when diagnosing cluster issues. Note that this PR covers only ROSA classic for now, as the HCP requires the changes here: https://issues.redhat.com/browse/SREP-2330

Key Features:

Automatically collects OpenShift diagnostics via oc adm must-gather when triggered by PagerDuty "CreateMustGather" alert

Compresses diagnostics into a tarball and uploads to Red Hat SFTP server using anonymous temporary credentials

Posts upload location to PagerDuty incident notes for easy SRE access

Tracks success metrics via Prometheus (cad_investigate_must_gather_performed_total)

Escalates to primary on any failure for immediate attention (likely to change in future)

Why we need it: Currently, SREs must manually:

Connect to cluster via backplane

Run oc adm must-gather

Download and upload diagnostics to SFTP

Share location with team

This automation eliminates these manual steps, providing immediate diagnostic data access and reducing time-to-resolution for cluster issues.

Special notes for your reviewer

Note that when testing this locally, the metadata.yml file must be commited to the main branch when using the local backplane setup
 git fetch origin [SREP-1881](https://issues.redhat.com//browse/SREP-1881)-automated-must-gathers-investigation && \
 git checkout main && \
 git checkout origin/SREP-1881-automated-must-gathers-investigation -- pkg/investigations/mustgather/metadata.yaml && \
 git add pkg/investigations/mustgather/metadata.yaml && \
 git commit -m "Add mustgather metadata.yaml for local testing" && \
 git checkout -
SFTP Security:

Uses temporary anonymous credentials from Red Hat SFTP API (time-limited)

Validates server SSH fingerprint: SHA256:Ij7dPhl1PhiycLC/rFXy1sGO2nSS9ky0PYdYhi+ykpQ

SFTP upload instructions are publicly documented at https://access.redhat.com/articles/5594481

Performance:

Typical 50MB must-gather: ~7 minutes end-to-end (5 minutes for SFTP upload at ~10 MB/min)

6-hour timeout configured for large files

Context-aware I/O allows graceful cancellation

Metrics Implementation:

Only records success (failures derived from alerts_total - must_gather_performed_total)

Label: "ROSA classic" to identify cluster type

Grafana Dashboard: Panel for must-gather metrics will be added later once real data points are available for proper visualization tuning

Future Enhancements (documented in README):

Retry logic for transient failures

Threshold-based alerting instead of immediate escalation

HCP/Hypershift support (pending backplane support)

Test Coverage

Guidelines for CAD investigations

New investgations should be accompanied by unit tests and/or step-by-step manual tests in the investigation README.

Actioning investigations should be locally tested in staging, and E2E testing is desired. See README for more info on investigation graduation process.

Test coverage checks

[x] Added tests

[ ] Created jira card to add unit test

[ ] This PR may not need unit tests

Pre-checks (if applicable)

[x] Ran unit tests locally

[x] Validated the changes in a cluster

[x] Included documentation changes with PR

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Nov 28 '25 13:11 openshift-ci-robot

Thanks for the review @bergmannf I've updated the PR

Dec 02 '25 15:12 rolandmkunkel

@rolandmkunkel: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Dec 02 '25 15:12 openshift-ci[bot]

/lgtm

Dec 05 '25 14:12 bergmannf

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bergmannf, rolandmkunkel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [bergmannf,rolandmkunkel]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Dec 05 '25 14:12 openshift-ci[bot]