configuration-anomaly-detection icon indicating copy to clipboard operation
configuration-anomaly-detection copied to clipboard

SREP-1881: Automated must-gathers ROSA classic implementation

Open rolandmkunkel opened this issue 3 months ago • 6 comments

What type of PR is this?

feature

What this PR does / Why we need it?

This PR implements automated must-gather collection and upload for ROSA classic clusters, reducing manual SRE effort when diagnosing cluster issues. Note that this PR covers only ROSA classic for now, as the HCP requires the changes here: https://issues.redhat.com/browse/SREP-2330

Key Features:

  • Automatically collects OpenShift diagnostics via oc adm must-gather when triggered by PagerDuty "CreateMustGather" alert
  • Compresses diagnostics into a tarball and uploads to Red Hat SFTP server using anonymous temporary credentials
  • Posts upload location to PagerDuty incident notes for easy SRE access
  • Tracks success metrics via Prometheus (cad_investigate_must_gather_performed_total)
  • Escalates to primary on any failure for immediate attention (likely to change in future)

Why we need it: Currently, SREs must manually:

  1. Connect to cluster via backplane
  2. Run oc adm must-gather
  3. Download and upload diagnostics to SFTP
  4. Share location with team

This automation eliminates these manual steps, providing immediate diagnostic data access and reducing time-to-resolution for cluster issues.

Special notes for your reviewer

Note that when testing this locally, the metadata.yml file must be commited to the main branch when using the local backplane setup

  git fetch origin SREP-1881-automated-must-gathers-investigation && \
  git checkout main && \
  git checkout origin/SREP-1881-automated-must-gathers-investigation -- pkg/investigations/mustgather/metadata.yaml && \
  git add pkg/investigations/mustgather/metadata.yaml && \
  git commit -m "Add mustgather metadata.yaml for local testing" && \
  git checkout -
  1. SFTP Security:
    • Uses temporary anonymous credentials from Red Hat SFTP API (time-limited)
    • Validates server SSH fingerprint: SHA256:Ij7dPhl1PhiycLC/rFXy1sGO2nSS9ky0PYdYhi+ykpQ
    • SFTP upload instructions are publicly documented at https://access.redhat.com/articles/5594481
  2. Performance:
    • Typical 50MB must-gather: ~7 minutes end-to-end (5 minutes for SFTP upload at ~10 MB/min)
    • 6-hour timeout configured for large files
    • Context-aware I/O allows graceful cancellation
  3. Metrics Implementation:
    • Only records success (failures derived from alerts_total - must_gather_performed_total)
    • Label: "ROSA classic" to identify cluster type
    • Grafana Dashboard: Panel for must-gather metrics will be added later once real data points are available for proper visualization tuning
  4. Future Enhancements (documented in README):
    • Retry logic for transient failures
    • Threshold-based alerting instead of immediate escalation
    • HCP/Hypershift support (pending backplane support)

Test Coverage

Guidelines for CAD investigations

  • New investgations should be accompanied by unit tests and/or step-by-step manual tests in the investigation README.
  • Actioning investigations should be locally tested in staging, and E2E testing is desired. See README for more info on investigation graduation process.

Test coverage checks

  • [x] Added tests
  • [ ] Created jira card to add unit test
  • [ ] This PR may not need unit tests

Pre-checks (if applicable)

  • [x] Ran unit tests locally
  • [x] Validated the changes in a cluster
  • [x] Included documentation changes with PR

rolandmkunkel avatar Nov 28 '25 13:11 rolandmkunkel

@rolandmkunkel: This pull request references SREP-1881 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

What type of PR is this?

feature

What this PR does / Why we need it?

This PR implements automated must-gather collection and upload for ROSA classic clusters, reducing manual SRE effort when diagnosing cluster issues. Note that this PR covers only ROSA classic for now, as the HCP requires the changes here: https://issues.redhat.com/browse/SREP-2330

Key Features:

  • Automatically collects OpenShift diagnostics via oc adm must-gather when triggered by PagerDuty "CreateMustGather" alert
  • Compresses diagnostics into a tarball and uploads to Red Hat SFTP server using anonymous temporary credentials
  • Posts upload location to PagerDuty incident notes for easy SRE access
  • Tracks success metrics via Prometheus (cad_investigate_must_gather_performed_total)
  • Escalates to primary on any failure for immediate attention (likely to change in future)

Why we need it: Currently, SREs must manually:

  1. Connect to cluster via backplane
  2. Run oc adm must-gather
  3. Download and upload diagnostics to SFTP
  4. Share location with team

This automation eliminates these manual steps, providing immediate diagnostic data access and reducing time-to-resolution for cluster issues.

Special notes for your reviewer

  1. SFTP Security:
    • Uses temporary anonymous credentials from Red Hat SFTP API (time-limited)
    • Validates server SSH fingerprint: SHA256:Ij7dPhl1PhiycLC/rFXy1sGO2nSS9ky0PYdYhi+ykpQ
    • SFTP upload instructions are publicly documented at https://access.redhat.com/articles/5594481
  2. Performance:
    • Typical 50MB must-gather: ~7 minutes end-to-end (5 minutes for SFTP upload at ~10 MB/min)
    • 6-hour timeout configured for large files
    • Context-aware I/O allows graceful cancellation
  3. Metrics Implementation:
    • Only records success (failures derived from alerts_total - must_gather_performed_total)
    • Label: "ROSA classic" to identify cluster type
    • Grafana Dashboard: Panel for must-gather metrics will be added later once real data points are available for proper visualization tuning
  4. Future Enhancements (documented in README):
    • Retry logic for transient failures
    • Threshold-based alerting instead of immediate escalation
    • HCP/Hypershift support (pending backplane support)

Test Coverage

Guidelines for CAD investigations

  • New investgations should be accompanied by unit tests and/or step-by-step manual tests in the investigation README.
  • Actioning investigations should be locally tested in staging, and E2E testing is desired. See README for more info on investigation graduation process.

Test coverage checks

  • [x] Added tests
  • [] Created jira card to add unit test
  • [ ] This PR may not need unit tests

Pre-checks (if applicable)

  • [x] Ran unit tests locally
  • [x] Validated the changes in a cluster
  • [x] Included documentation changes with PR

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Nov 28 '25 13:11 openshift-ci-robot

Codecov Report

:x: Patch coverage is 33.15789% with 127 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 35.37%. Comparing base (6fe1328) to head (3c3224c). :warning: Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
pkg/investigations/mustgather/sftpUpload.go 39.36% 53 Missing and 4 partials :warning:
pkg/investigations/mustgather/mustgather.go 0.00% 51 Missing :warning:
pkg/investigations/utils/tarball/tarball.go 61.90% 8 Missing and 8 partials :warning:
cadctl/cmd/investigate/investigate.go 0.00% 2 Missing :warning:
pkg/metrics/metrics.go 0.00% 1 Missing :warning:
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #628      +/-   ##
==========================================
- Coverage   35.56%   35.37%   -0.19%     
==========================================
  Files          43       46       +3     
  Lines        2829     3022     +193     
==========================================
+ Hits         1006     1069      +63     
- Misses       1745     1863     +118     
- Partials       78       90      +12     
Files with missing lines Coverage Δ
pkg/investigations/investigation/investigation.go 15.83% <ø> (ø)
pkg/investigations/registry.go 0.00% <ø> (ø)
pkg/ocm/ocm.go 0.00% <ø> (ø)
pkg/metrics/metrics.go 0.00% <0.00%> (ø)
cadctl/cmd/investigate/investigate.go 0.00% <0.00%> (ø)
pkg/investigations/utils/tarball/tarball.go 61.90% <61.90%> (ø)
pkg/investigations/mustgather/mustgather.go 0.00% <0.00%> (ø)
pkg/investigations/mustgather/sftpUpload.go 39.36% <39.36%> (ø)

... and 1 file with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov-commenter avatar Nov 28 '25 13:11 codecov-commenter

@rolandmkunkel: This pull request references SREP-1881 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

What type of PR is this?

feature

What this PR does / Why we need it?

This PR implements automated must-gather collection and upload for ROSA classic clusters, reducing manual SRE effort when diagnosing cluster issues. Note that this PR covers only ROSA classic for now, as the HCP requires the changes here: https://issues.redhat.com/browse/SREP-2330

Key Features:

  • Automatically collects OpenShift diagnostics via oc adm must-gather when triggered by PagerDuty "CreateMustGather" alert
  • Compresses diagnostics into a tarball and uploads to Red Hat SFTP server using anonymous temporary credentials
  • Posts upload location to PagerDuty incident notes for easy SRE access
  • Tracks success metrics via Prometheus (cad_investigate_must_gather_performed_total)
  • Escalates to primary on any failure for immediate attention (likely to change in future)

Why we need it: Currently, SREs must manually:

  1. Connect to cluster via backplane
  2. Run oc adm must-gather
  3. Download and upload diagnostics to SFTP
  4. Share location with team

This automation eliminates these manual steps, providing immediate diagnostic data access and reducing time-to-resolution for cluster issues.

Special notes for your reviewer

Note that when testing this locally, the metadata.yml file must be commited to the main branch when using the local backplane setup

 git fetch origin [SREP-1881](https://issues.redhat.com//browse/SREP-1881)-automated-must-gathers-investigation && \
 git checkout main && \
 git checkout origin/SREP-1881-automated-must-gathers-investigation -- pkg/investigations/mustgather/metadata.yaml && \
 git add pkg/investigations/mustgather/metadata.yaml && \
 git commit -m "Add mustgather metadata.yaml for local testing" && \
 git checkout -
  1. SFTP Security:
    • Uses temporary anonymous credentials from Red Hat SFTP API (time-limited)
    • Validates server SSH fingerprint: SHA256:Ij7dPhl1PhiycLC/rFXy1sGO2nSS9ky0PYdYhi+ykpQ
    • SFTP upload instructions are publicly documented at https://access.redhat.com/articles/5594481
  2. Performance:
    • Typical 50MB must-gather: ~7 minutes end-to-end (5 minutes for SFTP upload at ~10 MB/min)
    • 6-hour timeout configured for large files
    • Context-aware I/O allows graceful cancellation
  3. Metrics Implementation:
    • Only records success (failures derived from alerts_total - must_gather_performed_total)
    • Label: "ROSA classic" to identify cluster type
    • Grafana Dashboard: Panel for must-gather metrics will be added later once real data points are available for proper visualization tuning
  4. Future Enhancements (documented in README):
    • Retry logic for transient failures
    • Threshold-based alerting instead of immediate escalation
    • HCP/Hypershift support (pending backplane support)

Test Coverage

Guidelines for CAD investigations

  • New investgations should be accompanied by unit tests and/or step-by-step manual tests in the investigation README.
  • Actioning investigations should be locally tested in staging, and E2E testing is desired. See README for more info on investigation graduation process.

Test coverage checks

  • [x] Added tests
  • [] Created jira card to add unit test
  • [ ] This PR may not need unit tests

Pre-checks (if applicable)

  • [x] Ran unit tests locally
  • [x] Validated the changes in a cluster
  • [x] Included documentation changes with PR

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Nov 28 '25 13:11 openshift-ci-robot

@rolandmkunkel: This pull request references SREP-1881 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

What type of PR is this?

feature

What this PR does / Why we need it?

This PR implements automated must-gather collection and upload for ROSA classic clusters, reducing manual SRE effort when diagnosing cluster issues. Note that this PR covers only ROSA classic for now, as the HCP requires the changes here: https://issues.redhat.com/browse/SREP-2330

Key Features:

  • Automatically collects OpenShift diagnostics via oc adm must-gather when triggered by PagerDuty "CreateMustGather" alert
  • Compresses diagnostics into a tarball and uploads to Red Hat SFTP server using anonymous temporary credentials
  • Posts upload location to PagerDuty incident notes for easy SRE access
  • Tracks success metrics via Prometheus (cad_investigate_must_gather_performed_total)
  • Escalates to primary on any failure for immediate attention (likely to change in future)

Why we need it: Currently, SREs must manually:

  1. Connect to cluster via backplane
  2. Run oc adm must-gather
  3. Download and upload diagnostics to SFTP
  4. Share location with team

This automation eliminates these manual steps, providing immediate diagnostic data access and reducing time-to-resolution for cluster issues.

Special notes for your reviewer

Note that when testing this locally, the metadata.yml file must be commited to the main branch when using the local backplane setup

 git fetch origin [SREP-1881](https://issues.redhat.com//browse/SREP-1881)-automated-must-gathers-investigation && \
 git checkout main && \
 git checkout origin/SREP-1881-automated-must-gathers-investigation -- pkg/investigations/mustgather/metadata.yaml && \
 git add pkg/investigations/mustgather/metadata.yaml && \
 git commit -m "Add mustgather metadata.yaml for local testing" && \
 git checkout -
  1. SFTP Security:
    • Uses temporary anonymous credentials from Red Hat SFTP API (time-limited)
    • Validates server SSH fingerprint: SHA256:Ij7dPhl1PhiycLC/rFXy1sGO2nSS9ky0PYdYhi+ykpQ
    • SFTP upload instructions are publicly documented at https://access.redhat.com/articles/5594481
  2. Performance:
    • Typical 50MB must-gather: ~7 minutes end-to-end (5 minutes for SFTP upload at ~10 MB/min)
    • 6-hour timeout configured for large files
    • Context-aware I/O allows graceful cancellation
  3. Metrics Implementation:
    • Only records success (failures derived from alerts_total - must_gather_performed_total)
    • Label: "ROSA classic" to identify cluster type
    • Grafana Dashboard: Panel for must-gather metrics will be added later once real data points are available for proper visualization tuning
  4. Future Enhancements (documented in README):
    • Retry logic for transient failures
    • Threshold-based alerting instead of immediate escalation
    • HCP/Hypershift support (pending backplane support)

Test Coverage

Guidelines for CAD investigations

  • New investgations should be accompanied by unit tests and/or step-by-step manual tests in the investigation README.
  • Actioning investigations should be locally tested in staging, and E2E testing is desired. See README for more info on investigation graduation process.

Test coverage checks

  • [x] Added tests
  • [ ] Created jira card to add unit test
  • [ ] This PR may not need unit tests

Pre-checks (if applicable)

  • [x] Ran unit tests locally
  • [x] Validated the changes in a cluster
  • [x] Included documentation changes with PR

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Nov 28 '25 13:11 openshift-ci-robot

Thanks for the review @bergmannf I've updated the PR

rolandmkunkel avatar Dec 02 '25 15:12 rolandmkunkel

@rolandmkunkel: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Dec 02 '25 15:12 openshift-ci[bot]

/lgtm

bergmannf avatar Dec 05 '25 14:12 bergmannf

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bergmannf, rolandmkunkel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • ~~OWNERS~~ [bergmannf,rolandmkunkel]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Dec 05 '25 14:12 openshift-ci[bot]