SREP-1881: Automated must-gathers ROSA classic implementation
What type of PR is this?
feature
What this PR does / Why we need it?
This PR implements automated must-gather collection and upload for ROSA classic clusters, reducing manual SRE effort when diagnosing cluster issues. Note that this PR covers only ROSA classic for now, as the HCP requires the changes here: https://issues.redhat.com/browse/SREP-2330
Key Features:
- Automatically collects OpenShift diagnostics via oc adm must-gather when triggered by PagerDuty "CreateMustGather" alert
- Compresses diagnostics into a tarball and uploads to Red Hat SFTP server using anonymous temporary credentials
- Posts upload location to PagerDuty incident notes for easy SRE access
- Tracks success metrics via Prometheus (cad_investigate_must_gather_performed_total)
- Escalates to primary on any failure for immediate attention (likely to change in future)
Why we need it: Currently, SREs must manually:
- Connect to cluster via backplane
- Run oc adm must-gather
- Download and upload diagnostics to SFTP
- Share location with team
This automation eliminates these manual steps, providing immediate diagnostic data access and reducing time-to-resolution for cluster issues.
Special notes for your reviewer
Note that when testing this locally, the metadata.yml file must be commited to the main branch when using the local backplane setup
git fetch origin SREP-1881-automated-must-gathers-investigation && \
git checkout main && \
git checkout origin/SREP-1881-automated-must-gathers-investigation -- pkg/investigations/mustgather/metadata.yaml && \
git add pkg/investigations/mustgather/metadata.yaml && \
git commit -m "Add mustgather metadata.yaml for local testing" && \
git checkout -
- SFTP Security:
- Uses temporary anonymous credentials from Red Hat SFTP API (time-limited)
- Validates server SSH fingerprint: SHA256:Ij7dPhl1PhiycLC/rFXy1sGO2nSS9ky0PYdYhi+ykpQ
- SFTP upload instructions are publicly documented at https://access.redhat.com/articles/5594481
- Performance:
- Typical 50MB must-gather: ~7 minutes end-to-end (5 minutes for SFTP upload at ~10 MB/min)
- 6-hour timeout configured for large files
- Context-aware I/O allows graceful cancellation
- Metrics Implementation:
- Only records success (failures derived from alerts_total - must_gather_performed_total)
- Label: "ROSA classic" to identify cluster type
- Grafana Dashboard: Panel for must-gather metrics will be added later once real data points are available for proper visualization tuning
- Future Enhancements (documented in README):
- Retry logic for transient failures
- Threshold-based alerting instead of immediate escalation
- HCP/Hypershift support (pending backplane support)
Test Coverage
Guidelines for CAD investigations
- New investgations should be accompanied by unit tests and/or step-by-step manual tests in the investigation README.
- Actioning investigations should be locally tested in staging, and E2E testing is desired. See README for more info on investigation graduation process.
Test coverage checks
- [x] Added tests
- [ ] Created jira card to add unit test
- [ ] This PR may not need unit tests
Pre-checks (if applicable)
- [x] Ran unit tests locally
- [x] Validated the changes in a cluster
- [x] Included documentation changes with PR
@rolandmkunkel: This pull request references SREP-1881 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.
In response to this:
What type of PR is this?
feature
What this PR does / Why we need it?
This PR implements automated must-gather collection and upload for ROSA classic clusters, reducing manual SRE effort when diagnosing cluster issues. Note that this PR covers only ROSA classic for now, as the HCP requires the changes here: https://issues.redhat.com/browse/SREP-2330
Key Features:
- Automatically collects OpenShift diagnostics via oc adm must-gather when triggered by PagerDuty "CreateMustGather" alert
- Compresses diagnostics into a tarball and uploads to Red Hat SFTP server using anonymous temporary credentials
- Posts upload location to PagerDuty incident notes for easy SRE access
- Tracks success metrics via Prometheus (cad_investigate_must_gather_performed_total)
- Escalates to primary on any failure for immediate attention (likely to change in future)
Why we need it: Currently, SREs must manually:
- Connect to cluster via backplane
- Run oc adm must-gather
- Download and upload diagnostics to SFTP
- Share location with team
This automation eliminates these manual steps, providing immediate diagnostic data access and reducing time-to-resolution for cluster issues.
Special notes for your reviewer
- SFTP Security:
- Uses temporary anonymous credentials from Red Hat SFTP API (time-limited)
- Validates server SSH fingerprint: SHA256:Ij7dPhl1PhiycLC/rFXy1sGO2nSS9ky0PYdYhi+ykpQ
- SFTP upload instructions are publicly documented at https://access.redhat.com/articles/5594481
- Performance:
- Typical 50MB must-gather: ~7 minutes end-to-end (5 minutes for SFTP upload at ~10 MB/min)
- 6-hour timeout configured for large files
- Context-aware I/O allows graceful cancellation
- Metrics Implementation:
- Only records success (failures derived from alerts_total - must_gather_performed_total)
- Label: "ROSA classic" to identify cluster type
- Grafana Dashboard: Panel for must-gather metrics will be added later once real data points are available for proper visualization tuning
- Future Enhancements (documented in README):
- Retry logic for transient failures
- Threshold-based alerting instead of immediate escalation
- HCP/Hypershift support (pending backplane support)
Test Coverage
Guidelines for CAD investigations
- New investgations should be accompanied by unit tests and/or step-by-step manual tests in the investigation README.
- Actioning investigations should be locally tested in staging, and E2E testing is desired. See README for more info on investigation graduation process.
Test coverage checks
- [x] Added tests
- [] Created jira card to add unit test
- [ ] This PR may not need unit tests
Pre-checks (if applicable)
- [x] Ran unit tests locally
- [x] Validated the changes in a cluster
- [x] Included documentation changes with PR
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
Codecov Report
:x: Patch coverage is 33.15789% with 127 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 35.37%. Comparing base (6fe1328) to head (3c3224c).
:warning: Report is 6 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #628 +/- ##
==========================================
- Coverage 35.56% 35.37% -0.19%
==========================================
Files 43 46 +3
Lines 2829 3022 +193
==========================================
+ Hits 1006 1069 +63
- Misses 1745 1863 +118
- Partials 78 90 +12
| Files with missing lines | Coverage Δ | |
|---|---|---|
| pkg/investigations/investigation/investigation.go | 15.83% <ø> (ø) |
|
| pkg/investigations/registry.go | 0.00% <ø> (ø) |
|
| pkg/ocm/ocm.go | 0.00% <ø> (ø) |
|
| pkg/metrics/metrics.go | 0.00% <0.00%> (ø) |
|
| cadctl/cmd/investigate/investigate.go | 0.00% <0.00%> (ø) |
|
| pkg/investigations/utils/tarball/tarball.go | 61.90% <61.90%> (ø) |
|
| pkg/investigations/mustgather/mustgather.go | 0.00% <0.00%> (ø) |
|
| pkg/investigations/mustgather/sftpUpload.go | 39.36% <39.36%> (ø) |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@rolandmkunkel: This pull request references SREP-1881 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.
In response to this:
What type of PR is this?
feature
What this PR does / Why we need it?
This PR implements automated must-gather collection and upload for ROSA classic clusters, reducing manual SRE effort when diagnosing cluster issues. Note that this PR covers only ROSA classic for now, as the HCP requires the changes here: https://issues.redhat.com/browse/SREP-2330
Key Features:
- Automatically collects OpenShift diagnostics via oc adm must-gather when triggered by PagerDuty "CreateMustGather" alert
- Compresses diagnostics into a tarball and uploads to Red Hat SFTP server using anonymous temporary credentials
- Posts upload location to PagerDuty incident notes for easy SRE access
- Tracks success metrics via Prometheus (cad_investigate_must_gather_performed_total)
- Escalates to primary on any failure for immediate attention (likely to change in future)
Why we need it: Currently, SREs must manually:
- Connect to cluster via backplane
- Run oc adm must-gather
- Download and upload diagnostics to SFTP
- Share location with team
This automation eliminates these manual steps, providing immediate diagnostic data access and reducing time-to-resolution for cluster issues.
Special notes for your reviewer
Note that when testing this locally, the metadata.yml file must be commited to the main branch when using the local backplane setup
git fetch origin [SREP-1881](https://issues.redhat.com//browse/SREP-1881)-automated-must-gathers-investigation && \ git checkout main && \ git checkout origin/SREP-1881-automated-must-gathers-investigation -- pkg/investigations/mustgather/metadata.yaml && \ git add pkg/investigations/mustgather/metadata.yaml && \ git commit -m "Add mustgather metadata.yaml for local testing" && \ git checkout -
- SFTP Security:
- Uses temporary anonymous credentials from Red Hat SFTP API (time-limited)
- Validates server SSH fingerprint: SHA256:Ij7dPhl1PhiycLC/rFXy1sGO2nSS9ky0PYdYhi+ykpQ
- SFTP upload instructions are publicly documented at https://access.redhat.com/articles/5594481
- Performance:
- Typical 50MB must-gather: ~7 minutes end-to-end (5 minutes for SFTP upload at ~10 MB/min)
- 6-hour timeout configured for large files
- Context-aware I/O allows graceful cancellation
- Metrics Implementation:
- Only records success (failures derived from alerts_total - must_gather_performed_total)
- Label: "ROSA classic" to identify cluster type
- Grafana Dashboard: Panel for must-gather metrics will be added later once real data points are available for proper visualization tuning
- Future Enhancements (documented in README):
- Retry logic for transient failures
- Threshold-based alerting instead of immediate escalation
- HCP/Hypershift support (pending backplane support)
Test Coverage
Guidelines for CAD investigations
- New investgations should be accompanied by unit tests and/or step-by-step manual tests in the investigation README.
- Actioning investigations should be locally tested in staging, and E2E testing is desired. See README for more info on investigation graduation process.
Test coverage checks
- [x] Added tests
- [] Created jira card to add unit test
- [ ] This PR may not need unit tests
Pre-checks (if applicable)
- [x] Ran unit tests locally
- [x] Validated the changes in a cluster
- [x] Included documentation changes with PR
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
@rolandmkunkel: This pull request references SREP-1881 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.
In response to this:
What type of PR is this?
feature
What this PR does / Why we need it?
This PR implements automated must-gather collection and upload for ROSA classic clusters, reducing manual SRE effort when diagnosing cluster issues. Note that this PR covers only ROSA classic for now, as the HCP requires the changes here: https://issues.redhat.com/browse/SREP-2330
Key Features:
- Automatically collects OpenShift diagnostics via oc adm must-gather when triggered by PagerDuty "CreateMustGather" alert
- Compresses diagnostics into a tarball and uploads to Red Hat SFTP server using anonymous temporary credentials
- Posts upload location to PagerDuty incident notes for easy SRE access
- Tracks success metrics via Prometheus (cad_investigate_must_gather_performed_total)
- Escalates to primary on any failure for immediate attention (likely to change in future)
Why we need it: Currently, SREs must manually:
- Connect to cluster via backplane
- Run oc adm must-gather
- Download and upload diagnostics to SFTP
- Share location with team
This automation eliminates these manual steps, providing immediate diagnostic data access and reducing time-to-resolution for cluster issues.
Special notes for your reviewer
Note that when testing this locally, the metadata.yml file must be commited to the main branch when using the local backplane setup
git fetch origin [SREP-1881](https://issues.redhat.com//browse/SREP-1881)-automated-must-gathers-investigation && \ git checkout main && \ git checkout origin/SREP-1881-automated-must-gathers-investigation -- pkg/investigations/mustgather/metadata.yaml && \ git add pkg/investigations/mustgather/metadata.yaml && \ git commit -m "Add mustgather metadata.yaml for local testing" && \ git checkout -
- SFTP Security:
- Uses temporary anonymous credentials from Red Hat SFTP API (time-limited)
- Validates server SSH fingerprint: SHA256:Ij7dPhl1PhiycLC/rFXy1sGO2nSS9ky0PYdYhi+ykpQ
- SFTP upload instructions are publicly documented at https://access.redhat.com/articles/5594481
- Performance:
- Typical 50MB must-gather: ~7 minutes end-to-end (5 minutes for SFTP upload at ~10 MB/min)
- 6-hour timeout configured for large files
- Context-aware I/O allows graceful cancellation
- Metrics Implementation:
- Only records success (failures derived from alerts_total - must_gather_performed_total)
- Label: "ROSA classic" to identify cluster type
- Grafana Dashboard: Panel for must-gather metrics will be added later once real data points are available for proper visualization tuning
- Future Enhancements (documented in README):
- Retry logic for transient failures
- Threshold-based alerting instead of immediate escalation
- HCP/Hypershift support (pending backplane support)
Test Coverage
Guidelines for CAD investigations
- New investgations should be accompanied by unit tests and/or step-by-step manual tests in the investigation README.
- Actioning investigations should be locally tested in staging, and E2E testing is desired. See README for more info on investigation graduation process.
Test coverage checks
- [x] Added tests
- [ ] Created jira card to add unit test
- [ ] This PR may not need unit tests
Pre-checks (if applicable)
- [x] Ran unit tests locally
- [x] Validated the changes in a cluster
- [x] Included documentation changes with PR
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
Thanks for the review @bergmannf I've updated the PR
@rolandmkunkel: all tests passed!
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: bergmannf, rolandmkunkel
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [bergmannf,rolandmkunkel]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment