actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Add workflow job metrics to Github webhook server

Open ColinHeathman opened this issue 3 years ago • 1 comments

Relates to issue https://github.com/actions-runner-controller/actions-runner-controller/issues/1551

This PR adds these metrics:

github_workflow_job_queue
github_workflow_job_runtime
github_workflow_job_conclusion
github_workflow_job_status

The system does store webhook events temporarily, but only uses the events provided timestamps for measuring duration. This means the best resolution for these durations is 1 second.

ColinHeathman avatar Sep 19 '22 22:09 ColinHeathman

I have tested this locally with a basic installation. I am in the process of testing this change in an environment with some live traffic, so the PR will remain a draft until that's completed.

Nitpicks are welcome in the meantime

ColinHeathman avatar Sep 19 '22 22:09 ColinHeathman

@mumoshu Please take a look at this new implementation.

This version is stateless. I was able to find the information I need for the metrics from the logs for the jobs. This uses the GH client to download those logs and parse them, so there is a bit of a downside there, but I think having this be stateless is much better.

ColinHeathman avatar Oct 04 '22 00:10 ColinHeathman

@ColinHeathman Hey! Thanks a lot for your effort. Sorry to say, but this approach isn't going to scale and that's actually why we moved from the pull-based autoscaling to webhook-based autoscaling. GitHub API rate limit can be quite insufficient when you're going to run hundreds of jobs per hour. AFAIK it's 1000/hour with PAT, and 15000/hour with GitHub App organizational installation. You added an API call per every workflow_job event in the last commit. Assuming we might receive 3 workflow job events for queued, in_progress, and completed, this approach might make ARC with PAT not scale more than 333 jobs/hour, which is quite low. We do call some APIs in other places for e.g. runner graceful shutdown. So the actual limit would be lower.

mumoshu avatar Oct 04 '22 00:10 mumoshu

@ColinHeathman IMHO, making it stateful alone is fine, as long as the state is allowed to get lost and local. That said, I thought we'd be happy with your alternative mentioned in https://github.com/actions-runner-controller/actions-runner-controller/pull/1814#discussion_r975891494, which is to make those counters so that you can still count events per status, while you can also aggregate counts at query time using e.g. Prometheus, which makes it possible to approximately count events per status without introducing a global state to ARC. WDYT?

mumoshu avatar Oct 04 '22 00:10 mumoshu

That said, your awesome fetchAndParseWorkflowJobLogs and the related code might better fit within another controller/command within ARC(name it e.g. actions-metrics-server) or another application dedicated to monitoring workflow jobs from the information provided by GitHub Actions API, not from ARC.

In the dedicated controller/command/applilcation, you might be able to poll those APIs less frequently than 1 per workflow job event, which prevents rate limit issues while providing the metrics you want.

Note though, providing workflow job event status count metrics per https://github.com/actions-runner-controller/actions-runner-controller/pull/1814#issuecomment-1266235039 is orthogonal to the metrics derived from GitHub Actions. The former is for monitoring ARC and the latter is for monitoring GitHub Actions. Both look useful to me. Just that I think we shouldn't mix the two.

mumoshu avatar Oct 04 '22 00:10 mumoshu

@ColinHeathman IMHO, making it stateful alone is fine, as long as the state is allowed to get lost. That said, I thought we'd be happy with your alternative mentioned in https://github.com/actions-runner-controller/actions-runner-controller/pull/1814#discussion_r975891494, which is to make those counters so that you can still count events per status, while you can also aggregate counts at query time using e.g. Prometheus, which makes it possible to approximately count events per status without introducing a global state to ARC. WDYT?

@mumoshu I did make the change to counters here (the gauges are gone) that's well and good.

Runtimes can be measured off of the action=completed event, but it's less accurate than the logs. I think losing that accuracy is fine though.

I currently don't see a way to get Queue times without either comparing events or making an API call for the logs. I think maybe that needs to be a separate issue or PR since it's going to be more difficult.

ColinHeathman avatar Oct 04 '22 00:10 ColinHeathman

What do you think of the metric names? I want to be sure that they are correct

ColinHeathman avatar Oct 04 '22 00:10 ColinHeathman

@ColinHeathman Thanks for your confirmation. The metrics names look great to me!

mumoshu avatar Oct 04 '22 01:10 mumoshu

Runtimes can be measured off of the action=completed event, but it's less accurate than the logs. I think losing that accuracy is fine though.

I'm fine with that too. We'd better add some comment in code to let us keep remembering that the run_duration metric count is not ging to be the same as workflow_job_completions_total when you have two or more ARC webhook-server pods.

I currently don't see a way to get Queue times without either comparing events or making an API call for the logs. I think maybe that needs to be a separate issue or PR since it's going to be more difficult.

I agree with you! If it's going to be something like what I've explained in https://github.com/actions-runner-controller/actions-runner-controller/pull/1814#issuecomment-1266264170, that looks good to me.

mumoshu avatar Oct 04 '22 01:10 mumoshu

@ColinHeathman Hey! Is this ready for review? If it's working for your cluster, that's all right- I'll try resolving merge conflicts, test it on my cluster, and add necessary changes if you're busy.

mumoshu avatar Nov 07 '22 10:11 mumoshu

Hey @mumoshu, this does work in my cluster. We at BenchSci are currently using this as it is, since we have GH Enterprise and don't need to worry so much about the API rate limit.

I was planning on moving this to the dedicated controller/command/applilcation as you suggested in https://github.com/actions-runner-controller/actions-runner-controller/pull/1814#issuecomment-1266264170, but I haven't had the time to do the rework. I don't think I'll be able to get to it until next month.

Feel free to make changes :) I can try to help anyway I can

ColinHeathman avatar Nov 10 '22 02:11 ColinHeathman

@ColinHeathman Hey! I finally had some time to work on it, and #2057's what I intended. Would you mind taking a look when you have some time?

mumoshu avatar Nov 30 '22 15:11 mumoshu

@ColinHeathman Hey! I'm closing this in favor of #2057. Thank you again for doing 99% of the work in this pull request! I've featured you in our release note to make your awesome work more visible ❤️ https://github.com/actions-runner-controller/actions-runner-controller/pull/2068

mumoshu avatar Dec 07 '22 10:12 mumoshu

#2057 will be included in the next release of ARC, 0.27.0. Please feel free to give it a shot and report back any issues and ideas if any 🙏

mumoshu avatar Dec 07 '22 10:12 mumoshu