When using monorepo, react only on changes done to configured paths
Description
When using monorepo (=more packages in one git repository, defined using the packages key) with many packages, triggering actions for all the packages is not needed.
Let's check the affected paths first and react only if anything has been done to the package-related path.
Benefit
- Safe of resources. (=Only the important/relevant jobs are done.)
- Cleaner UX. (=Just important/relevant results are shown.)
Importance
This has been asked multiple times, but lately by @LecrisUT who wants to use a monorepo with ~500 packages.
What is the impacted category (job)?
General
Workaround
- [ ] There is an existing workaround that can be used until this feature is implemented.
Participation
- [ ] I am willing to submit a pull request for this issue. (Packit team is happy to help!)
I am preparing some tips/ideas for the implementation but am now not sure if this should be pull-request/merge-request specific. WDYT? Should we run only change-relevant jobs for pull requests but all for branch/release builds so we are sure everything is rebuilt once merged? Or, is it misleading/confusing and the new builds not run on PR/MR could potentially fail?
Option 1:
Add the info to the Event classes and use this when creating job tasks:
- [ ] Add a new property to all the relevant event classes: https://github.com/packit/packit-service/blob/a8dddb3d4be8ca29110f93e671a54a8a3fd6be7a/packit_service/worker/events/github.py#L86
- [ ] Fill in the info when parsing:
https://github.com/packit/packit-service/blob/a8dddb3d4be8ca29110f93e671a54a8a3fd6be7a/packit_service/worker/parser.py#L250
- [ ] Check if we can actually get that info: https://docs.github.com/en/webhooks-and-events/webhooks/webhook-events-and-payloads?actionType=opened#pull_request (If I've checked correctly, it's not in the event payload and we need to use GitHub API (using this OGR method in our case) to get that info.
- [ ] Use the info when filtering the new tasks to be added to our job queue: https://github.com/packit/packit-service/blob/a8dddb3d4be8ca29110f93e671a54a8a3fd6be7a/packit_service/worker/jobs.py#L446
I would say to have it for both PRs and branch commits initially. Later this should be handled by the dependencies logic with rebuild/sync-version logic from here. To bridge the gap between them, maybe the user could run the packit build in-copr for the select packages or use a \packit build in the PR specifying the packages there (later just \packit build dependents).
Option 2:
Probably more simple. Check this info when creating the job tasks:
- [ ] Use GitHub API (using this OGR method in our case) to get the info about changed files.
- [ ] And use it when filtering the new tasks to be added to our job queue. Probably here: https://github.com/packit/packit-service/blob/a8dddb3d4be8ca29110f93e671a54a8a3fd6be7a/packit_service/worker/jobs.py#L446
I would say to have it for both PRs and branch commits initially.
:+1: That makes sense.
I am preparing some tips/ideas for the implementation but am now not sure if this should be pull-request/merge-request specific. WDYT? Should we run only change-relevant jobs for pull requests but all for branch/release builds so we are sure everything is rebuilt once merged? Or, is it misleading/confusing and the new builds not run on PR/MR could potentially fail?
I think the triggering method should stay the same, we should always create all jobs and let the jobs themself decide if they are relevant or not, but maybe I am missing something...
Option 1:
Add the info to the Event classes and use this when creating job tasks:
[ ] Add a new property to all the relevant event classes: https://github.com/packit/packit-service/blob/a8dddb3d4be8ca29110f93e671a54a8a3fd6be7a/packit_service/worker/events/github.py#L86
[ ] Fill in the info when parsing: https://github.com/packit/packit-service/blob/a8dddb3d4be8ca29110f93e671a54a8a3fd6be7a/packit_service/worker/parser.py#L250
- [ ] Check if we can actually get that info: https://docs.github.com/en/webhooks-and-events/webhooks/webhook-events-and-payloads?actionType=opened#pull_request (If I've checked correctly, it's not in the event payload and we need to use GitHub API (using this OGR method in our case) to get that info.
[ ] Use the info when filtering the new tasks to be added to our job queue: https://github.com/packit/packit-service/blob/a8dddb3d4be8ca29110f93e671a54a8a3fd6be7a/packit_service/worker/jobs.py#L446
I know that with these solutions we can save some cpu time. But to me this seems like work that should be done by the Job itself. I think the jobs won't always have the same filtering method; most of them will just look for something that happened inside their path. But others, which are dependent, could also check for changes elsewhere. And maybe there could exist jobs that need to be done always to keep things consistent (I don't really know, just a fear).
I think the triggering method should stay the same, we should always create all jobs and let the jobs themself decide if they are relevant or not, but maybe I am missing something...
So, do it on the handler level (, where)? OR as a handler pre-check?
I think this can be a pre-check. At first all the jobs can have the same pre-check code which checks if something happened inside the package's paths and later it could be easily customized or removed per job' s handler.
To detect the files changed how about running purely on the git repo and CI status:
- On PRs: check against the target branch
- On branches: check against last run packit job
This would be very useful to implement - if you look at https://github.com/fedora-copr/copr/pull/2720 for us it would spawn cca 70 builds for every PR with every change. This is overkill a bit considering that we need to build only a fraction of it.
Should we run only change-relevant jobs for pull requests but all for branch/release builds so we are sure everything is rebuilt once merged?
Good question. For Copr specifically, we don't need to build everything once PR is merged and I believe that many users may have the same use-case so it may be useful to do this as opt-in feature for some folks who want to build everything.
Regarding the two options, the first one sounds too much complicated to me but I don't know the implementation detail. My first thought how to implemented was naive approach - using git diff directly between main branch and PR branch and then matching the subdirectories with content mentioned in .packit.yaml (paths key). And this sounds familiar to the second option (use OGR to get changed files and then process). Is there some implementation detail that stops you from doing it this way?
We can start with the PR diff and get the info from GitHub. If I remember correctly, the affected files are available via API so we can skip the job even before doing anything with the cloned git repo.
any ETA on this? This blocks Copr to use packit in PRs
hi @nikromen , no we haven't discussed this for longer time, but I may bring this up at our next team meeting and let you know.