hydra icon indicating copy to clipboard operation
hydra copied to clipboard

redo the notify events between eval and queue-runner

Open disassembler opened this issue 6 years ago • 5 comments

(cherry picked from commit 73ca325d1c0f7914640a63764c9a6d448fde5bd0)

We did this at IOHK to improve some performance problems with notifications. It's still not perfect, and there may be a good reason to not do it this way, but creating a PR for the community to discuss whether this is worthwhile, and if so if it needs improvements to get in.

disassembler avatar Mar 20 '19 19:03 disassembler

from clever on IRC:

when hydra does a new eval, the old version reports the lowest build# it just added in the eval. so, if a given eval creates builds 6, 7, and 8, it will report 6 over the psql event channel the queue-runner will then query psql, for every build >=6, that is unfinished, and begin to process jobs. the problem, is that build 3, was cached from a previous eval, and wont be processed. and so, the github status plugin, never gets told about build3 finishing on the new eval and github is never told that the job is still passing on the new git rev. my change, makes it report the eval# to queue-runner, which then refreshes every build in the eval

grahamc avatar Mar 21 '19 18:03 grahamc

If I understand correctly what this does, it could cause a lot of extra load on hydra.nixos.org, since for every Nixpkgs/NixOS evaluation, it will have to process every finished build again and run plugins. That's tens of thousands of builds per evaluation...

edolstra avatar Mar 21 '19 21:03 edolstra

yeah, it may be best to not upstream this until the hydra-notify perf issues are fixed, possibly as a result of redoing the whole notify system

cleverca22 avatar Mar 25 '19 17:03 cleverca22

FYI, I've rewritten this to work with recent Hydra versions here:

https://github.com/hackworthltd/hydra/commit/f9bf7f6f4982ef35fd64d0c6f88a96a182de5dff

We've been using it in production for awhile and it seems to work reasonably well, with one caveat that I'll describe below.

I dunno why IOHK originally wrote this patch (something to do with performance?), but the reason we use it is as follows:

  • Say you have a Hydra jobset with 20 jobs. Your Hydra is set up to notify GitHub of the status of this jobset.
  • You open a PR, which kicks off a Hydra build. Say all 20 jobs complete, and GitHub shows the status as "20 checks passed."
  • Now you make a small change to the PR and push. Say this change only causes 5 of the jobs to re-run, and the rest are cached.
  • GitHub will now only show "5 checks passed," and that's all it ever will report.

As the user, there is no way to know whether the remaining 15 jobs are still running, the reporting failed, or what, unless you go check Hydra. This is not very nice UX, as normally you only need to check Hydra when a job fails and you want to see the logs.

So, this patch fixes the problem by ensuring that all 20 job statuses are reported in this scenario, and that's why we use it.

There is one problem, however. This patch (or at least my rewritten version — I'm not sure whether this issue exists in @cleverca22's original patch) will report all cached statuses for the given job to GitHub. So, for example, if a particular job has been cached for the last 4 commits to main, and another build is triggered where that job is once again cached, Hydra will report 5 status updates to GitHub for the one job name (with different job IDs). This is not a problem for the GitHub status integration, as GitHub will happily ignore the redundant status updates.

However, if you want to do something like kick off a GitHub Action workflow whenever a job named foo completes successfully (e.g., with a Hydra GitHub Action plugin), then if you're also running this patch, and foo has been cached the past 5 builds, then Hydra will trigger the workflow 5 times! :)

So while I'm not requesting that this particular patch should be merged, it would be nice if one could configure a Hydra to report cached status updates to GitHub when a jobset is built (and just once, rather than n times when the job has been cached n times).

(There is currently no GitHub Action plugin for Hydra, but we have a very hacky one here, which is how I discovered this wrinkle in the eval patch: https://github.com/hackworthltd/hydra/commit/a5ce9dce8d40ded1b535e31f3db418e7795aa226)

dhess avatar Aug 23 '21 17:08 dhess

Two more things:

  1. If there are performance concerns for the NixOS Hydra, this could be made an optional feature.
  2. I'm happy to open a new issue for the cached status feature, if anyone thinks it should be in its own issue.

dhess avatar Aug 23 '21 17:08 dhess