homu icon indicating copy to clipboard operation
homu copied to clipboard

Sometimes approves the wrong commit

Open ehuss opened this issue 2 years ago • 9 comments

There have been two circumstances where a normal @bors r+ approved an old commit:

  • https://github.com/rust-lang/rust/pull/95251#issuecomment-1079483432
  • https://github.com/rust-lang/rust/pull/98708#issuecomment-1171587241

My hunch is that somehow bors missed the push notification, and the state.head_sha doesn't get updated to the latest version.

I think Homu should never approve a commit that is not the latest commit. I imagine it should check what the latest commit is when approving instead of assuming that the database is in sync. There's definitely risks about race conditions here that may not be solvable, but some extra effort might make it more resilient.

What's scary is that this may be happening and nobody notices. These two instances only got noticed because the old commit failed in CI.

ehuss avatar Jul 01 '22 04:07 ehuss

Related Zulip: https://rust-lang.zulipchat.com/#narrow/stream/242791-t-infra/topic/Bors.20races.20.2F.20out.20of.20sync/near/288119101

So homu is the result of GitHub REST API rate limitation if I understood it correctly where as bors-ng is stateless and homu stateful relying entirely on the ingress webhooks for state synchronisation - which seems to have failed few or more times.

I am unsure if there is some type of retry mechanism in the ingress GH webhook calls and whether homu honors retries or whether this is something that could be configured on GH side or make Homu support - which could provide easy low hanging fruit fix to increase stability - webhooks are supposed to get status back until they are delivered EDIT: Yes there is new API for exactly that ^^^ :partying_face:

1 - Figure out the impact first

I am going to see if I can do a GraphQL query/ies to scrape off all the PR r+ comments/commits and correlate to figure out how many times - and exactly where - homu has picked up the wrong commit vs what the latest commit was given PR

That analysis would / will allow us to see which and how many PRs are affected and then see how much of an issue this is.

2 - Potentially fix something

Then to potentially fix this in Homu it would mean either - depending on reliability / real impact factors as above

  1. Using GraphQL queries to ensure the state is synchronsised that doesn't hurt rate limits - if impact was high
  • or alternatively low hanging fruit if impact was low -
  1. Figure out why webhook calls are being lost and maybe put some more reliability tlc around those

pinkforest avatar Jul 01 '22 10:07 pinkforest

Re: GitHub delivery tracking API

Seems GitHub has introduced API exactly for what we might need re 2.2) ..

https://github.blog/changelog/2021-06-30-webhook-deliveries-api/

https://docs.github.com/en/rest/reference/repos#webhooks

Can someone go see what the situation looks for 30 last days?

Also it may as well be that simple "healthcheck" polling monitor that can trigger the retries via this new API in homu might greatly improve webhook deliverability if required ?

And if the "healthcheck" monitor simply reports to Homu that there are undelivered webhooks it would hang on before acting on things..

pinkforest avatar Jul 01 '22 10:07 pinkforest

I think rather than focusing on approval time, we should probably have a check at (bors) merge commit generation time that the approved source commit is equivalent to the PR head commit at that time. That also helps us avoid issues with further pushes being missed.

I don't think focusing on missed webhooks is going to work well; GitHub has just not delivered things in the past for some time, we shouldn't depend on that for reliability here.

Mark-Simulacrum avatar Jul 01 '22 12:07 Mark-Simulacrum

That would require querying GH one per commit generation for those but I guess there isn't too many of those that would break the rate limit as was the reason for Homu to exist Also that would be just to provide a failure and feedback mechanism but not really fixing the sync issue that happens at approval time Are we happy with that? Or do we mean that homu will pick nearest-before-approval commit regardless what was supposed approved commit (out of sync) without providing feedback about some internal sync failure e.g. just deal with it?

pinkforest avatar Jul 01 '22 12:07 pinkforest

We would post an error message in that case. There's no real issue with rate limits -- we're only generating new merge commits probably roughly <15 times/day, so checking at that time is very cheap.

I think there's more work that can be done to improve homu's synchronization with real state, but that's a separate issue from this one and doesn't need to be coupled.

Mark-Simulacrum avatar Jul 01 '22 13:07 Mark-Simulacrum

I'll also try to finish the impact analysis during the weekend off curiosity -

just needs some regular expression on bors messages and comparing them to commit history

I just got some quick GraphQL running on GH API for doing analysis what has been going on: https://github.com/pinkforest/rust-gh-homu-off-syncs

---- PR(true): Add a `--build-dir` flag to rustbuild
 79f8dc0 - commit
bors - :pushpin: Commit 79f8dc0b898b0a387df684a539cd97446a0f964f has been approved by `jyn514`

<!-- @bors r=jyn514 79f8dc0b898b0a387df684a539cd97446a0f964f -->
<!-- homu: {"type":"Approved","sha":"79f8dc0b898b0a387df684a539cd97446a0f964f","approver":"jyn514"} -->
---- PR(true): Add macro_rules! rustdoc change to 1.62 relnotes
 4ea18cc - commit
bors - :v: @ can now approve this pull request
<!-- homu: {"type":"Delegated","delegator":"jyn514","delegate":""} -->
bors - :v: @CAD97 can now approve this pull request
<!-- homu: {"type":"Delegated","delegator":"jyn514","delegate":"CAD97"} -->
bors - :pushpin: Commit ab437ad5e46db30f1dc08d21cd73f9ef9ffa13b5 has been approved by `Mark-Simulacrum`

<!-- @bors r=Mark-Simulacrum ab437ad5e46db30f1dc08d21cd73f9ef9ffa13b5 -->
<!-- homu: {"type":"Approved","sha":"ab437ad5e46db30f1dc08d21cd73f9ef9ffa13b5","approver":"Mark-Simulacrum"} -->
bors - :pushpin: Commit 4ea18ccf7e4e4afe213c2b3a74c558135c423fde has been approved by `jyn514`

<!-- @bors r=jyn514 4ea18ccf7e4e4afe213c2b3a74c558135c423fde -->
<!-- homu: {"type":"Approved","sha":"4ea18ccf7e4e4afe213c2b3a74c558135c423fde","approver":"jyn514"} -->
---- PR(true): Rollup of 9 pull requests
 1d845bd - commit
 c6f362a - commit
 ddb6313 - commit
 debee1e - commit
 8931fbd - commit
 e043821 - commit
 c29e584 - commit
 cd7bd8b - commit
 3fcf84a - commit
 d791310 - commit
 79f8dc0 - commit
 4ea18cc - commit
 0d5636c - commit
 41e7991 - commit
 0420231 - commit
 e59693a - commit
 734f21c - commit
 80dd48b - commit
 c4acd06 - commit
 335e7d3 - commit
 18d4228 - commit
bors - :pushpin: Commit 18d4228456a98fd6d8950f74fd117aba7fb45757 has been approved by `matthiaskrgr`

<!-- @bors r=matthiaskrgr 18d4228456a98fd6d8950f74fd117aba7fb45757 -->
<!-- homu: {"type":"Approved","sha":"18d4228456a98fd6d8950f74fd117aba7fb45757","approver":"matthiaskrgr"} -->
bors - :hourglass: Testing commit 18d4228456a98fd6d8950f74fd117aba7fb45757 with merge 7e2733bb1dd9afe5fd20370ca4d539d42ac50419...
<!-- homu: {"type":"BuildStarted","head_sha":"18d4228456a98fd6d8950f74fd117aba7fb45757","merge_sha":"7e2733bb1dd9afe5fd20370ca4d539d42ac50419"} -->
bors - :sunny: Test successful - [checks-actions](https://github.com/rust-lang-ci/rust/runs/7148849106?check_suite_focus=true)
Approved by: matthiaskrgr
Pushing 7e2733bb1dd9afe5fd20370ca4d539d42ac50419 to master...
<!-- homu: {"type":"BuildCompleted","approved_by":"matthiaskrgr","base_ref":"master","builders":{"checks-actions":"https://github.com/rust-lang-ci/rust/runs/7148849106?check_suite_focus=true"},"merge_sha":"7e2733bb1dd9afe5fd20370ca4d539d42ac50419"} -->

After that I can push a PR for Homu to do that merge commit commit sync check

pinkforest avatar Jul 01 '22 13:07 pinkforest

if I understood it correctly where as bors-ng is stateless and homu stateful

Unfortunately, no. There are three borses, not two.

  • bors — https://github.com/graydon/bors — is stateless
  • homu — https://github.com/rust-lang/homu — is stateful. It exists because of GitHub's rate limits.
  • bors-ng — https://github.com/bors-ng/bors-ng — is also stateful. It exists because self-hosting homu is too complicated. Since literally anybody can sign up their repository at any time, bors-ng also has to worry about rate limits.

notriddle avatar Jul 02 '22 00:07 notriddle

Raised #178 for suspected general WebHook delivery instability which addressing commit mixups alone would not fix (this ticket)

pinkforest avatar Jul 17 '22 11:07 pinkforest

For reference, here is another case where this (or something similar) happened https://github.com/rust-lang/rust/pull/119748#issuecomment-1971772018

tgross35 avatar Mar 01 '24 07:03 tgross35