opensearch-build Monitor Critical GitHub Actions Workflows Across Organization Repositories

Is your feature request related to a problem? Please describe

Background

We recently had a situation where publish snapshots to maven github actions workflow started failing across all the repositories due to an issue on sonatype central side. They had accidently deleted user tokens during maintainence and our jobs started failing with 401 errors. The operator accidently happen to check the failed workflow on the commit they merged and saw snapshot workflow failure, upon further investigation it was found that the same workflow had been failing across all the repositories with same error for past 24-hours. We need to implement a system to monitor critical GitHub Actions workflows across multiple repositories in our organization. This will help us quickly identify and respond to workflow failures or issues.

Describe the solution you'd like

Proposed Solutions

We have identified two broad categories of approaches: pull-based and push-based monitoring.

1. Pull-based Monitoring

Description

a) Oboard github actions workflow metrics onto existing metrics framework (Recommended)

Use exisintg metrics frame to onboard github actions metrics
Add monitor on failure metric and notify in slack channel, already implemented

b) Use GitHub REST APIs to periodically fetch the GitHub Actions status

Index the collected data in an OpenSearch cluster
Implement a pull job that runs on a cron schedule in Jenkins
Use GitHub REST APIs to periodically fetch the GitHub Actions status
Index the collected data in an OpenSearch cluster
Implement a pull job that runs on a cron schedule in Jenkins

Advantages

Centralized monitoring solution
Can provide historical data and trends
Allows for custom alerting based on various criteria

Challenges

May have slight delay in detecting issues due to polling interval
Need to manage API rate limits

2. Push-based Monitoring

Description

a) Slack Notifications Integration in Workflows

Add a Slack action to critical workflows
Configure the action to send a Slack message notification when a job fails

b) Email Notifications

Use GitHub's built-in email notification system or a custom email action
Send detailed email reports for workflow failures

d) Webhook Integration

Set up a custom webhook endpoint in our infrastructure
Configure GitHub to send workflow status updates to this endpoint
Process incoming webhooks to trigger appropriate actions (e.g., update a status page, send notifications)

Advantages

Real-time notifications
Simple to set up and maintain
No additional infrastructure required
Notification storm/fatigue during multiple failures across all repos.

Challenges

No centralized data storage for historical analysis
Requires updating each workflow file individually

Next Steps

Discuss and decide on the preferred approach (pull-based, push-based, or a combination)
Create a detailed implementation plan for the chosen approach(es)
Assign team members to various tasks
Set up a timeline for implementation and testing
Plan for gradual rollout and monitoring of the new system

Questions to Consider

What defines a "critical" workflow in our organization?
How quickly do we need to be notified of issues?
Do we need historical data for analysis, or are real-time alerts sufficient?
Who should receive notifications, and how should they be prioritized?
How will we handle false positives or transient failures?

Please comment with your thoughts, preferences, or any additional considerations for this monitoring system.

Describe alternatives you've considered

No response

Additional context

No response

Aug 13 '24 17:08 rishabh6788

Tagging @peterzhuamazon @gaiksaya @getsaurabh02 @prudhvigodithi @dblock for feedback and way forward.

Aug 13 '24 17:08 rishabh6788

Thanks @rishabh6788 this this an important enhancement. With the gathered data of GitHub Action Workflows we can even have summary of force merged pull requests, which is an important metric for the OpenSearch repo health. @getsaurabh02 @dblock

I would vote for 1st option to collect the incremental PR workflows, index the data and create a monitoring tool on top of the indexed raw data. Going with option 2, even if we created a custom GitHun action for this purpose it would be tough to update the 100's of workflows files across all the repos and ensuring that for new repos this action exists is tedious job. If we go with solution 1 and running the workflow more aggressively to just monitor the incremental PR workflows would reduce the delay in detecting issues.

Thank you

Aug 13 '24 17:08 prudhvigodithi

I am also in line with the pull based monitoring and carefully choose the data source we want to monitor. However, there will be still gaps where certain actions only run once per a month during release phase.

We need to figure out a consistent way to dry-run these actions in order to detect issues beforehand.

Thanks.

Aug 13 '24 18:08 peterzhuamazon

Going with option 1 we can do the following:

Today, the metrics code collects the daily incremental PRs (updated, created, merged, closed) across all repositories.
For the list of the PR's that are retrieved, index the head commit. Example https://api.github.com/repos/opensearch-project/dashboards-observability/pulls/2084

Screenshot 2024-08-21 at 6 53 33 PM

Now within the same scope or a seperate process use check-runs API from GitHub to get the CI runs for the associated commit, example https://api.github.com/repos/opensearch-project/query-insights/check-runs/29083082462

Example https://api.github.com/repos/opensearch-project/query-insights/commits/1f4c4c635d6704e637004e9f363735461db21c2d/check-runs

Now the check-runs gives all the information of the CI runs for that commit (coming from a PR) and index the relevant important information like name, status, conclusion etc.
Build the monitoring tool around the indexed data, but running a query on the cluster and find the runs with "conclusion": "failure",, we can even target the specific runs for example "name": "build-and-publish-snapshots" which has conclusion as failures.
We can even use this information to get a new metrics (Force merged PR's and its trend) to find the PR's that are force merged with the failing CI checks.

@getsaurabh02 @dblock @rishabh6788 @peterzhuamazon @gaiksaya

Aug 22 '24 03:08 prudhvigodithi

Following is the sample schema that can be indexed to the metrics cluster.

{
  id: <The id of the workflow run and can be directly used as document ID, directly given as part of check-runs API response >
  repository: <The Repo name>
  organization: <Optional: The Repo org>
  number: <PR number for which the workflow has triggered>
  pull_commit: <The head commit of the PR for which the workflow has triggered, should be inferred from pull API>
  merged: <The current state of the PR if merged true/false, should be inferred from pull API>
  commit_id: <The Commit ID of the PR for which the workflow has triggered, this commit should be inferred from pull API>
  html_url: <The html_url of the workflow run, directly given as part of check-runs API response>
  url: <The url of the workflow run, directly given as part of check-runs API response>
  name: <The name of the workflow run, directly given as part of check-runs API response>
  conclusion: <The result of the workflow run, directly given as part of check-runs API response>
  started_at: <The started timestamp of the workflow run, directly given as part of check-runs API response>
  completed_at: <The completed timestamp of the workflow run, directly given as part of check-runs API response>
}

Once we have the above information:

We should be able to monitor the desired workflows.
Create visualizations and trend graphs of repos with failing CI workflows and ability to filter per repo.
Monitor and create visualizations of repos where PR's are merged without the passing CI's.
Create issues with directly PR and workflow run information and URl's.

Thank you @rishabh6788 @getsaurabh02

Sep 06 '24 16:09 prudhvigodithi

Did some more deep dive on the possible repo workflows.

To check all the possible action runs at the repo level (part of the .github/workflows), example https://api.github.com/repos/opensearch-project/opensearch-build/actions/runs?per_page=100&created=2024-09-22..2024-09-23. This should give all the action workflows triggered by all possible events https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/events-that-trigger-workflows.
However the above API does not show the app based runs, which are of type check-runs (runs like mend and DCO). So to see the status and monitor these type of runs we should get the head_commit and use the API https://api.github.com/repos/opensearch-project/opensearch-build/commits/51b8b104ee98251aa8d38c24c2b9791a9206c5df/check-runs to see the status of the app based runs.
Here is a small scenario for the repo and for an event the DCO action failed https://github.com/opensearch-project/opensearch-build/runs/30403041967, but the DCO failure is not recorded in actions/runs https://api.github.com/repos/opensearch-project/opensearch-build/actions/runs?per_page=100&created=2024-08-22..2024-09-23&head_sha=51b8b104ee98251aa8d38c24c2b9791a9206c5df since the DCO is not part of .github/workflows and for this we should use https://api.github.com/repos/opensearch-project/opensearch-build/commits/51b8b104ee98251aa8d38c24c2b9791a9206c5df/check-runs.
Coming from this comment https://github.com/opensearch-project/opensearch-build/issues/4941#issuecomment-2303625866 if we target to monitor the workflows only part of the PR, we will end up missing workflows part of the repo that are not always triggered by a PR (and the PR events). So we should use https://docs.github.com/en/rest/actions/workflow-runs?apiVersion=2022-11-28 and at the same time for app based check-runs we should be using check-runs API based on the head commit.

Sep 24 '24 17:09 prudhvigodithi

Sync up with Prudhvi today and confirm that automation app is able to grab all the necessary context for the requirements.

We will see if we can combine the automation app and metrics cluster together on this.

Thanks.

Sep 24 '24 23:09 peterzhuamazon

Here is the final flow details, implemented based on all the merged pull requests linked to this issue.

graph LR
    A[GitHub Workflow Events] --> B[GitHub Automation App]
    B --> C[Failure Detection]
    C --> D[Workflow Failure Identified]
    D --> E[CloudWatch Alarms Update]
    D --> F[Failures Indexed]
    E --> I{Alarm Triggered?}
    I -- Yes --> G[Alerts Sent to Teams]
    I -- No --> J[No Action]
    F --> H[Data for Debugging and Trend Analysis]

Oct 08 '24 16:10 prudhvigodithi

Closing this issue. @rishabh6788 @getsaurabh02

Oct 08 '24 16:10 prudhvigodithi

opensearch-build opensearch-build copied to clipboard

Monitor Critical GitHub Actions Workflows Across Organization Repositories

Is your feature request related to a problem? Please describe

Background

Describe the solution you'd like

Proposed Solutions

1. Pull-based Monitoring

Description

Advantages

Challenges

2. Push-based Monitoring

Description

Advantages

Challenges

Next Steps

Questions to Consider

Describe alternatives you've considered

Additional context

Following is the sample schema that can be indexed to the metrics cluster.

Once we have the above information:

opensearch-build
opensearch-build copied to clipboard