skara
skara copied to clipboard

1606: PullRequestPoller always processes the last updated MR for GitLab

Open erikj79 opened this issue 2 years ago • 4 comments

The new PullRequestPoller from SKARA-1565 is inefficient in combination with GitLab. In the most common case, when no MRs have been updated, you would expect it to only do one GET query and be done. Instead, that initial query will always return the last updated MR, and then continue to fetch all the metadata for it, which adds up to a lot of time.

This is made even worse by having to fetch comments and reviews for every MR to be able to compare them for potential updates. In GitLab, both comments and reviews are stored in the same set of "notes" for a merge request. Since we are only fetching them for comparing between different snapshots, we can make this faster by just fetching all notes once (which will also increase accuracy as there are other kinds of notes too, which we probably should check for updates too).

The repeated refetching of the last updated MR is basically the same problem as the IssuePoller is already solving. The resolution for timestamp based queries is limited (1s on GitLab) and the query is inclusive. In contrast, the current implementation for GitLabRepository treats the timestamp as exclusive.

This patch does the following:

Change all timestamp based pull request and issue queries (including for test implementations) to be inclusive, for consistency.
Implement the same kind of "positive" padding to the updatedAfter parameter in PullRequestPoller as is already present in IssuePoller. I'm also modifying tests to actually verify this behavior by exposing some of the internal data of the poller to the test.
Add PullRequest::comparisonSnapshot which returns an Object that represents all data that needs to be considered when evaluating if a snapshot is equal to another. In GitLabMergeRequest, this is cached lazily to avoid repeated remote calls.

While working on this, I've realized that IssuePoller and PullRequestPoller are basically doing the exact same thing now, with pretty small variations. Ideally we should combine these into a common super class, but I'm leaving that for a separate change to make the current changes easier to follow.

I'm going to run this change in "staging" for a bit to see how it behaves.

Progress

[ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
[x] Change must not contain extraneous whitespace

Issue

SKARA-1606: PullRequestPoller always processes the last updated MR for GitLab

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/skara pull/1380/head:pull/1380
$ git checkout pull/1380

Update a local copy of the PR:
$ git checkout pull/1380
$ git pull https://git.openjdk.org/skara pull/1380/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 1380

View PR using the GUI difftool:
$ git pr show -t 1380

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/skara/pull/1380.diff

Sep 22 '22 21:09 erikj79

:wave: Welcome back erikj! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Sep 22 '22 21:09 bridgekeeper[bot]

Webrevs

Sep 22 '22 21:09 mlbridge[bot]

Looking at logs, this is now working as intended.

Sep 23 '22 14:09 erikj79

This change is currently blocking deployment of new changes to Skara bots, unless we back out SKARA-1590. This is because the PullRequestPoller isn't behaving well enough to actually be used yet. It would certainly be better to get this fix in however, so we can move forward with more efficient polling.

Oct 04 '22 16:10 erikj79

Hey @erikj79, nice work 🎉

First of all, I think your changes are fine, the reasoning and the code checks out. I have a bunch of suggestions, but those you can choose whether you want to pursue or not, they do not have to be made within the scope of this PR. Smaller things to consider:

in GitLabMergeRequest, since you are already caching the comparisonSnapshot, should we instead cache the notes JSON result? There are several methods in GitLabMergeRequest that makes use of the notes JSON, then they would be able to utilise this cached result as well (saving a number of requests). This thought can of course be extended to other JSON objects that GitLabMergeRequest fetches as well.
very minor nit, but I would have named comparisonSnapshot just snapshot, but that is up to you 😄
more of a question than a comment: the snapshot functionality will mean that we won't even run the bots if a MR hasn't changed, but what is the main cost of running the bots on a MR that hasn't changed? My guess would be that the expensive part are all the requests that the bots have to make in order to figure out that they don't have to do anything, but if we cache more aggressively, would that cost go away? Anyways, just a thought

Now for a bigger experiment 🧑‍🔬 One annoying thing is that the GitLab REST API doesn't return the comments for a MR when fetching a single MR (unlike GitHub which does). So I played around a bit with the GitLab GraphQL API and think I managed to achieve this:

$ GRAPHQL_TOKEN=<a personal access token with read-api scope>
$ curl -v 'https://<gitlab-domain>/api/graphql' \
     --header "Authorization: Bearer $GRAPHQL_TOKEN" \
     --header "Content-Type: application/json"
     --request POST \
     --data '{"query": "{ project(fullPath: \"<group-name>/<project-name>\") { mergeRequests(state: opened) { nodes { title notes { nodes { id body  } } } }  } }" }'

GraphQL queries can a look a bit messy when typing the on the command-line for CURL, below is the query a bit nicer formatted:

{
  "query": 
  "{ project(fullPath: \"<group-name>/<project-name>\") {
        mergeRequests(state: opened) {
          nodes {
            title
              notes {
                nodes { 
                  id 
                  body
                }
            }
          }
        }
      }
   }"
}

The query would have to be expanded to get include all the fields we use from the two GET calls we do today fetch the main MR information and the MR "notes" (comments). I think that can be done and if it can be, then GitLabRepository::pullRequests could be updated to utilise this (a new constructor for GitLabMergeRequest would also be needed to store the notes, that ties into my previous suggestion of caching the notes).

With the above, and with your great work on the PullRequestPoller (and drilling down into why GitLab doesn't honor the "updatedAfter" query parameter) I think we would get away with doing a single POST to GitLab's GraphQL endpoint when checking for updated MRs (since the notes would be included). So if nothing has changed, then all that is going to happen is a single HTTP request, and we can't optimise that away 😆

Feel free to ping me offline if you want to chat about GraphQL and these suggestions. Anyhow, this PR is good to go as it is, but there is also room to make further improvements 👍

Oct 07 '22 09:10 edvbld

@erikj79 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

:mag: One or more changes in this pull request modifies files in areas of the source code that often require two reviewers. Please consider if this is the case for this pull request, and if so, await a second reviewer to approve this pull request before you integrate it.

After integration, the commit message for the final commit will be:

1606: PullRequestPoller always processes the last updated MR for GitLab

Reviewed-by: ehelin

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 11 new commits pushed to the master branch:

b7b4d114a01cb703f35450ffe4820ce047fd235b: 1616: Improve error message when an author isn't found in GitLab
32a07238fde7773d64f0330e76d77f8d5d44f82a: 1603: Make labels handling consistent in all Issue implementations
ae79322ae8dd6ce91e00d46f84139020e3b55f90: 1563: The method 'CheckRun#updateMergeReadyComment' shouldn't update the comment if the comment has not changed
522d894fded2cf99b88d7873de5b69faed161cd4: 1215: Clean label not updating correctly
b061577e3ed45b464e73b865d8b73528b9166e07: 1612: Temp workaround GitLab returning 500 when modifying labels
a6860bcba94d93a2be7429ad7a8766755a925581: 1364: Jira Issues should be resolved as "Fixed"
773ec06c2a813f03d4341cf4ae36f9c5ae12fd70: 327: git webrev: Print the location of the generated webrev upon successful completion
7cf47ca037f9157195d330d746ffd7164edffea2: 1512: MirrorBot may get stuck failing to clone repository
78cab78418e8282ea7b08eb5e41395a481898d59: 1562: The method 'CheckRun#findComment' don't need the comment list as parameter
0a4aa299f3d5702f32e850d944694f15015ef086: 1235: git sync --branches doesn't work as expected
... and 1 more: https://git.openjdk.org/skara/compare/a1f48dd18c38b49aa9bb53b3a93d42b833fdf5f0...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

Oct 07 '22 09:10 openjdk[bot]

First of all, I think your changes are fine, the reasoning and the code checks out. I have a bunch of suggestions, but those you can choose whether you want to pursue or not, they do not have to be made within the scope of this PR. Smaller things to consider:

I also have a lot of ideas for how to further optimize. I'm trying to work iteratively on this so that each change can be easily proven correct (to make reviews easier and to hopefully avoid some mistakes). I also want to evaluate the metrics I have added between changes to make sure we attack the right things.

in GitLabMergeRequest, since you are already caching the comparisonSnapshot, should we instead cache the notes JSON result? There are several methods in GitLabMergeRequest that makes use of the notes JSON, then they would be able to utilise this cached result as well (saving a number of requests). This thought can of course be extended to other JSON objects that GitLabMergeRequest fetches as well.

Maybe, but I'm not sure if every fetch of a GitLabMergeRequest will eventually result in fetching notes. I thin I have been leaning towards a lazy fetch of notes instead. That would still be two queries, or rather an extra query per MR, but still a lot less than we have today.

very minor nit, but I would have named comparisonSnapshot just snapshot, but that is up to you 😄

I picked the name to make it clear that the intended purpose of this snapshot object is only for comparisons. Different implementations will include different sets of data, so no guarantees are given on anything besides that. Not sure I succeeded with that name though, so may change as suggested.

more of a question than a comment: the snapshot functionality will mean that we won't even run the bots if a MR hasn't changed, but what is the main cost of running the bots on a MR that hasn't changed? My guess would be that the expensive part are all the requests that the bots have to make in order to figure out that they don't have to do anything, but if we cache more aggressively, would that cost go away? Anyways, just a thought

There are plenty of other things we do in most WorkItems. We often instantiate a local workspace based on the PR source branch. This is quite expensive (5-10s in some repos). We instantiate some kind of Census instance, also in the order of seconds. In the mailinglist bridge, instantiating the marks workspace takes over 1 minute(!) but that is something I plan to address separately.

Now for a bigger experiment 🧑‍🔬 One annoying thing is that the GitLab REST API doesn't return the comments for a MR when fetching a single MR (unlike GitHub which does). So I played around a bit with the GitLab GraphQL API and think I managed to achieve this:

Playing around with GraphQL could certainly help reducing the number of queries needed. I haven't tried that yet.

Note though that GitLab does not include comments in the initial json. There is just no need to include comments in the comparisonSnapshot for GitLabPullRequest, because the updated_at field is always updated by comment changes.

Oct 07 '22 17:10 erikj79

/integrate

Oct 07 '22 18:10 erikj79

Going to push as commit 4b40646da7589e3dd88e34d00112bddc3dd1428b. Since your change was applied there have been 11 commits pushed to the master branch:

b7b4d114a01cb703f35450ffe4820ce047fd235b: 1616: Improve error message when an author isn't found in GitLab
32a07238fde7773d64f0330e76d77f8d5d44f82a: 1603: Make labels handling consistent in all Issue implementations
ae79322ae8dd6ce91e00d46f84139020e3b55f90: 1563: The method 'CheckRun#updateMergeReadyComment' shouldn't update the comment if the comment has not changed
522d894fded2cf99b88d7873de5b69faed161cd4: 1215: Clean label not updating correctly
b061577e3ed45b464e73b865d8b73528b9166e07: 1612: Temp workaround GitLab returning 500 when modifying labels
a6860bcba94d93a2be7429ad7a8766755a925581: 1364: Jira Issues should be resolved as "Fixed"
773ec06c2a813f03d4341cf4ae36f9c5ae12fd70: 327: git webrev: Print the location of the generated webrev upon successful completion
7cf47ca037f9157195d330d746ffd7164edffea2: 1512: MirrorBot may get stuck failing to clone repository
78cab78418e8282ea7b08eb5e41395a481898d59: 1562: The method 'CheckRun#findComment' don't need the comment list as parameter
0a4aa299f3d5702f32e850d944694f15015ef086: 1235: git sync --branches doesn't work as expected
... and 1 more: https://git.openjdk.org/skara/compare/a1f48dd18c38b49aa9bb53b3a93d42b833fdf5f0...master

Your commit was automatically rebased without conflicts.

Oct 07 '22 18:10 openjdk[bot]

@erikj79 Pushed as commit 4b40646da7589e3dd88e34d00112bddc3dd1428b.

:bulb: You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.