opensearch-build
opensearch-build copied to clipboard
Monitor Critical GitHub Actions Workflows Across Organization Repositories
Is your feature request related to a problem? Please describe
Background
We recently had a situation where publish snapshots to maven github actions workflow started failing across all the repositories due to an issue on sonatype central side. They had accidently deleted user tokens during maintainence and our jobs started failing with 401 errors.
The operator accidently happen to check the failed workflow on the commit they merged and saw snapshot workflow failure, upon further investigation it was found that the same workflow had been failing across all the repositories with same error for past 24-hours.
We need to implement a system to monitor critical GitHub Actions workflows across multiple repositories in our organization. This will help us quickly identify and respond to workflow failures or issues.
Describe the solution you'd like
Proposed Solutions
We have identified two broad categories of approaches: pull-based and push-based monitoring.
1. Pull-based Monitoring
Description
a) Oboard github actions workflow metrics onto existing metrics framework (Recommended)
- Use exisintg metrics frame to onboard github actions metrics
- Add monitor on failure metric and notify in slack channel, already implemented
b) Use GitHub REST APIs to periodically fetch the GitHub Actions status
- Index the collected data in an OpenSearch cluster
- Implement a pull job that runs on a cron schedule in Jenkins
- Use GitHub REST APIs to periodically fetch the GitHub Actions status
- Index the collected data in an OpenSearch cluster
- Implement a pull job that runs on a cron schedule in Jenkins
Advantages
- Centralized monitoring solution
- Can provide historical data and trends
- Allows for custom alerting based on various criteria
Challenges
- May have slight delay in detecting issues due to polling interval
- Need to manage API rate limits
2. Push-based Monitoring
Description
a) Slack Notifications Integration in Workflows
- Add a Slack action to critical workflows
- Configure the action to send a Slack message notification when a job fails
b) Email Notifications
- Use GitHub's built-in email notification system or a custom email action
- Send detailed email reports for workflow failures
d) Webhook Integration
- Set up a custom webhook endpoint in our infrastructure
- Configure GitHub to send workflow status updates to this endpoint
- Process incoming webhooks to trigger appropriate actions (e.g., update a status page, send notifications)
Advantages
- Real-time notifications
- Simple to set up and maintain
- No additional infrastructure required
- Notification storm/fatigue during multiple failures across all repos.
Challenges
- No centralized data storage for historical analysis
- Requires updating each workflow file individually
Next Steps
- Discuss and decide on the preferred approach (pull-based, push-based, or a combination)
- Create a detailed implementation plan for the chosen approach(es)
- Assign team members to various tasks
- Set up a timeline for implementation and testing
- Plan for gradual rollout and monitoring of the new system
Questions to Consider
- What defines a "critical" workflow in our organization?
- How quickly do we need to be notified of issues?
- Do we need historical data for analysis, or are real-time alerts sufficient?
- Who should receive notifications, and how should they be prioritized?
- How will we handle false positives or transient failures?
Please comment with your thoughts, preferences, or any additional considerations for this monitoring system.
Describe alternatives you've considered
No response
Additional context
No response
Tagging @peterzhuamazon @gaiksaya @getsaurabh02 @prudhvigodithi @dblock for feedback and way forward.
Thanks @rishabh6788 this this an important enhancement. With the gathered data of GitHub Action Workflows we can even have summary of force merged pull requests, which is an important metric for the OpenSearch repo health. @getsaurabh02 @dblock
I would vote for 1st option to collect the incremental PR workflows, index the data and create a monitoring tool on top of the indexed raw data. Going with option 2, even if we created a custom GitHun action for this purpose it would be tough to update the 100's of workflows files across all the repos and ensuring that for new repos this action exists is tedious job. If we go with solution 1 and running the workflow more aggressively to just monitor the incremental PR workflows would reduce the delay in detecting issues.
Thank you
I am also in line with the pull based monitoring and carefully choose the data source we want to monitor. However, there will be still gaps where certain actions only run once per a month during release phase.
We need to figure out a consistent way to dry-run these actions in order to detect issues beforehand.
Thanks.
Going with option 1 we can do the following:
- Today, the metrics code collects the daily incremental PRs (updated, created, merged, closed) across all repositories.
- For the list of the PR's that are retrieved, index the head commit. Example https://api.github.com/repos/opensearch-project/dashboards-observability/pulls/2084
- Now within the same scope or a seperate process use check-runs API from GitHub to get the CI runs for the associated commit, example https://api.github.com/repos/opensearch-project/query-insights/check-runs/29083082462
Example https://api.github.com/repos/opensearch-project/query-insights/commits/1f4c4c635d6704e637004e9f363735461db21c2d/check-runs
-
Now the check-runs gives all the information of the CI runs for that commit (coming from a PR) and index the relevant important information like
name,status,conclusionetc. -
Build the monitoring tool around the indexed data, but running a query on the cluster and find the runs with
"conclusion": "failure",, we can even target the specific runs for example"name": "build-and-publish-snapshots"which hasconclusionas failures. -
We can even use this information to get a new metrics (Force merged PR's and its trend) to find the PR's that are force merged with the failing CI checks.
@getsaurabh02 @dblock @rishabh6788 @peterzhuamazon @gaiksaya
Following is the sample schema that can be indexed to the metrics cluster.
{
id: <The id of the workflow run and can be directly used as document ID, directly given as part of check-runs API response >
repository: <The Repo name>
organization: <Optional: The Repo org>
number: <PR number for which the workflow has triggered>
pull_commit: <The head commit of the PR for which the workflow has triggered, should be inferred from pull API>
merged: <The current state of the PR if merged true/false, should be inferred from pull API>
commit_id: <The Commit ID of the PR for which the workflow has triggered, this commit should be inferred from pull API>
html_url: <The html_url of the workflow run, directly given as part of check-runs API response>
url: <The url of the workflow run, directly given as part of check-runs API response>
name: <The name of the workflow run, directly given as part of check-runs API response>
conclusion: <The result of the workflow run, directly given as part of check-runs API response>
started_at: <The started timestamp of the workflow run, directly given as part of check-runs API response>
completed_at: <The completed timestamp of the workflow run, directly given as part of check-runs API response>
}
Once we have the above information:
- We should be able to monitor the desired workflows.
- Create visualizations and trend graphs of repos with failing CI workflows and ability to filter per repo.
- Monitor and create visualizations of repos where PR's are merged without the passing CI's.
- Create issues with directly PR and workflow run information and URl's.
Thank you @rishabh6788 @getsaurabh02
Did some more deep dive on the possible repo workflows.
-
To check all the possible action runs at the repo level (part of the
.github/workflows), example https://api.github.com/repos/opensearch-project/opensearch-build/actions/runs?per_page=100&created=2024-09-22..2024-09-23. This should give all the action workflows triggered by all possible events https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/events-that-trigger-workflows. -
However the above API does not show the app based runs, which are of type check-runs (runs like mend and DCO). So to see the status and monitor these type of runs we should get the
head_commitand use the API https://api.github.com/repos/opensearch-project/opensearch-build/commits/51b8b104ee98251aa8d38c24c2b9791a9206c5df/check-runs to see the status of the app based runs. -
Here is a small scenario for the repo and for an event the DCO action failed https://github.com/opensearch-project/opensearch-build/runs/30403041967, but the DCO failure is not recorded in
actions/runshttps://api.github.com/repos/opensearch-project/opensearch-build/actions/runs?per_page=100&created=2024-08-22..2024-09-23&head_sha=51b8b104ee98251aa8d38c24c2b9791a9206c5df since the DCO is not part of.github/workflowsand for this we should use https://api.github.com/repos/opensearch-project/opensearch-build/commits/51b8b104ee98251aa8d38c24c2b9791a9206c5df/check-runs. -
Coming from this comment https://github.com/opensearch-project/opensearch-build/issues/4941#issuecomment-2303625866 if we target to monitor the workflows only part of the PR, we will end up missing workflows part of the repo that are not always triggered by a PR (and the PR events). So we should use https://docs.github.com/en/rest/actions/workflow-runs?apiVersion=2022-11-28 and at the same time for app based check-runs we should be using check-runs API based on the head commit.
Sync up with Prudhvi today and confirm that automation app is able to grab all the necessary context for the requirements.
We will see if we can combine the automation app and metrics cluster together on this.
Thanks.
Here is the final flow details, implemented based on all the merged pull requests linked to this issue.
graph LR
A[GitHub Workflow Events] --> B[GitHub Automation App]
B --> C[Failure Detection]
C --> D[Workflow Failure Identified]
D --> E[CloudWatch Alarms Update]
D --> F[Failures Indexed]
E --> I{Alarm Triggered?}
I -- Yes --> G[Alerts Sent to Teams]
I -- No --> J[No Action]
F --> H[Data for Debugging and Trend Analysis]
Closing this issue. @rishabh6788 @getsaurabh02