packit-service When using monorepo, react only on changes done to configured paths

Description

When using monorepo (=more packages in one git repository, defined using the packages key) with many packages, triggering actions for all the packages is not needed.

Let's check the affected paths first and react only if anything has been done to the package-related path.

Benefit

Safe of resources. (=Only the important/relevant jobs are done.)
Cleaner UX. (=Just important/relevant results are shown.)

Importance

This has been asked multiple times, but lately by @LecrisUT who wants to use a monorepo with ~500 packages.

What is the impacted category (job)?

General

Workaround

[ ] There is an existing workaround that can be used until this feature is implemented.

Participation

[ ] I am willing to submit a pull request for this issue. (Packit team is happy to help!)

Apr 14 '23 07:04 lachmanfrantisek

I am preparing some tips/ideas for the implementation but am now not sure if this should be pull-request/merge-request specific. WDYT? Should we run only change-relevant jobs for pull requests but all for branch/release builds so we are sure everything is rebuilt once merged? Or, is it misleading/confusing and the new builds not run on PR/MR could potentially fail?

Apr 14 '23 07:04 lachmanfrantisek

Option 1:

Add the info to the Event classes and use this when creating job tasks:

[ ] Add a new property to all the relevant event classes: https://github.com/packit/packit-service/blob/a8dddb3d4be8ca29110f93e671a54a8a3fd6be7a/packit_service/worker/events/github.py#L86
[ ] Fill in the info when parsing: https://github.com/packit/packit-service/blob/a8dddb3d4be8ca29110f93e671a54a8a3fd6be7a/packit_service/worker/parser.py#L250
- [ ] Check if we can actually get that info: https://docs.github.com/en/webhooks-and-events/webhooks/webhook-events-and-payloads?actionType=opened#pull_request (If I've checked correctly, it's not in the event payload and we need to use GitHub API (using this OGR method in our case) to get that info.
[ ] Use the info when filtering the new tasks to be added to our job queue: https://github.com/packit/packit-service/blob/a8dddb3d4be8ca29110f93e671a54a8a3fd6be7a/packit_service/worker/jobs.py#L446

Apr 14 '23 07:04 lachmanfrantisek

I would say to have it for both PRs and branch commits initially. Later this should be handled by the dependencies logic with rebuild/sync-version logic from here. To bridge the gap between them, maybe the user could run the packit build in-copr for the select packages or use a \packit build in the PR specifying the packages there (later just \packit build dependents).

Apr 14 '23 07:04 LecrisUT

Option 2:

Probably more simple. Check this info when creating the job tasks:

[ ] Use GitHub API (using this OGR method in our case) to get the info about changed files.
[ ] And use it when filtering the new tasks to be added to our job queue. Probably here: https://github.com/packit/packit-service/blob/a8dddb3d4be8ca29110f93e671a54a8a3fd6be7a/packit_service/worker/jobs.py#L446

Apr 14 '23 07:04 lachmanfrantisek

I would say to have it for both PRs and branch commits initially.

:+1: That makes sense.

Apr 14 '23 07:04 lachmanfrantisek

I am preparing some tips/ideas for the implementation but am now not sure if this should be pull-request/merge-request specific. WDYT? Should we run only change-relevant jobs for pull requests but all for branch/release builds so we are sure everything is rebuilt once merged? Or, is it misleading/confusing and the new builds not run on PR/MR could potentially fail?

I think the triggering method should stay the same, we should always create all jobs and let the jobs themself decide if they are relevant or not, but maybe I am missing something...

Apr 14 '23 08:04 majamassarini

Option 1:

Add the info to the Event classes and use this when creating job tasks:

[ ] Add a new property to all the relevant event classes: https://github.com/packit/packit-service/blob/a8dddb3d4be8ca29110f93e671a54a8a3fd6be7a/packit_service/worker/events/github.py#L86

[ ] Fill in the info when parsing: https://github.com/packit/packit-service/blob/a8dddb3d4be8ca29110f93e671a54a8a3fd6be7a/packit_service/worker/parser.py#L250

[ ] Check if we can actually get that info: https://docs.github.com/en/webhooks-and-events/webhooks/webhook-events-and-payloads?actionType=opened#pull_request (If I've checked correctly, it's not in the event payload and we need to use GitHub API (using this OGR method in our case) to get that info.

[ ] Use the info when filtering the new tasks to be added to our job queue: https://github.com/packit/packit-service/blob/a8dddb3d4be8ca29110f93e671a54a8a3fd6be7a/packit_service/worker/jobs.py#L446

I know that with these solutions we can save some cpu time. But to me this seems like work that should be done by the Job itself. I think the jobs won't always have the same filtering method; most of them will just look for something that happened inside their path. But others, which are dependent, could also check for changes elsewhere. And maybe there could exist jobs that need to be done always to keep things consistent (I don't really know, just a fear).

Apr 14 '23 08:04 majamassarini

I think the triggering method should stay the same, we should always create all jobs and let the jobs themself decide if they are relevant or not, but maybe I am missing something...

So, do it on the handler level (, where)? OR as a handler pre-check?

Apr 14 '23 08:04 lachmanfrantisek

I think this can be a pre-check. At first all the jobs can have the same pre-check code which checks if something happened inside the package's paths and later it could be easily customized or removed per job' s handler.

Apr 14 '23 10:04 majamassarini

To detect the files changed how about running purely on the git repo and CI status:

On PRs: check against the target branch
On branches: check against last run packit job

Apr 14 '23 11:04 LecrisUT

This would be very useful to implement - if you look at https://github.com/fedora-copr/copr/pull/2720 for us it would spawn cca 70 builds for every PR with every change. This is overkill a bit considering that we need to build only a fraction of it.

Should we run only change-relevant jobs for pull requests but all for branch/release builds so we are sure everything is rebuilt once merged?

Good question. For Copr specifically, we don't need to build everything once PR is merged and I believe that many users may have the same use-case so it may be useful to do this as opt-in feature for some folks who want to build everything.

Regarding the two options, the first one sounds too much complicated to me but I don't know the implementation detail. My first thought how to implemented was naive approach - using git diff directly between main branch and PR branch and then matching the subdirectories with content mentioned in .packit.yaml (paths key). And this sounds familiar to the second option (use OGR to get changed files and then process). Is there some implementation detail that stops you from doing it this way?

Jul 04 '23 13:07 nikromen

We can start with the PR diff and get the info from GitHub. If I remember correctly, the affected files are available via API so we can skip the job even before doing anything with the cloned git repo.

Jul 04 '23 16:07 lachmanfrantisek

any ETA on this? This blocks Copr to use packit in PRs

Mar 27 '24 09:03 nikromen

hi @nikromen , no we haven't discussed this for longer time, but I may bring this up at our next team meeting and let you know.

Apr 08 '24 09:04 lbarcziova