azurelinux icon indicating copy to clipboard operation
azurelinux copied to clipboard

Add priority queuing to optimize implicit nodes

Open dmcilvaney opened this issue 2 years ago • 1 comments

Merge Checklist

All boxes should be checked before merging the PR (just tick any boxes which don't apply to this PR)

  • [ ] The toolchain has been rebuilt successfully (or no changes were made to it)
  • [ ] The toolchain/worker package manifests are up-to-date
  • [ ] Any updated packages successfully build (or no packages were changed)
  • [ ] Packages depending on static components modified in this PR (Golang, *-static subpackages, etc.) have had their Release tag incremented.
  • [ ] Package tests (%check section) have been verified with RUN_CHECK=y for existing SPEC files, or added to new SPEC files
  • [ ] All package sources are available
  • [ ] cgmanifest files are up-to-date and sorted (./cgmanifest.json, ./toolkit/scripts/toolchain/cgmanifest.json, .github/workflows/cgmanifest.json)
  • [ ] LICENSE-MAP files are up-to-date (./SPECS/LICENSES-AND-NOTICES/data/licenses.json, ./SPECS/LICENSES-AND-NOTICES/LICENSES-MAP.md, ./SPECS/LICENSES-AND-NOTICES/LICENSE-EXCEPTIONS.PHOTON)
  • [ ] All source files have up-to-date hashes in the *.signatures.json files
  • [ ] sudo make go-tidy-all and sudo make go-test-coverage pass
  • [ ] Documentation has been updated to match any changes to the build system
  • [ ] Ready to merge

Summary

Add a priority based build queue such that implicit provides are resolved without wasted work.

The goal is to avoid building packages that aren't required for a given goal. Currently if the subgraph rooted at our goal (say the list of all packages that need to be built for an image config) contains a dynamic dependency (looks like something(name) or /path/to/file/) the scheduler can't optimize the build.

The scheduler will attempt to take the full graph that contains all packages defined locally, and prune to just the required subgraph. It does this every time it gets a build result back, until the graph is successfully optimized. The optimization will fail if we have a dynamic/implicit dependency. We can't optimize since we don't konw what package will actually end up supplying that dependncy, they generally won't have an explicit Provides: /path/to/file in the .spec which would allow the graph to encode that information. Instead, we have to wait until a package build is complete to scan the resulting .rpm to see if it provides anything extra beyond what the .spec claims.

In the current version, if the graph cannot be optimized the scheduler will affectively a random walk through the graph, operating on each leaf node it finds. When a package is built it is "removed" from the graph (actually just marked as done) and new "leaf" nodes may be exposed. Each time a package is built, the rpms are scanned and the graph updated to remove any implicit nodes that are now satisfied. Once all implicit nodes are removed from the subgraph under the goal, the graph is pruned of all the non-essential nodes and the build is optimized.

Instead, we should try to prioritize reaching the nodes that provide the implicit dependencies. As above, it isn't possible to know with 100% confidence that a given package will still provide a given implicit dependency, but there is some information that can provide a decent guess.

When the package fetcher runs to populate the graph, it assumes that each implicit dependency may not be found in the local packages and queries the repo for a provider (99% of the time, we get the same package back as we have locally +- some version differences). It places this package into the cache as a backup incase the local packages can't provide the dependency. The scheduler has a global mode that tries to build as much as possible without using these cached copies. Only if the scheduler gets completely stuck will it enable the cached implicit nodes.

We can use this information to make an educated guess which nodes will end up providing the implicit dependency. Say we have my-pkg-1.1.1-1 published on PMC, and my-pkg-1.1.1-2 as a .spec file locally. We have another package goal-pkg-1-1 that has the line BuildRequires: /path/to/my-pkg/some-file. If we look at the %files section of my-pkg we might see:

%files
/path/to/%{name}/*

Clearly we won't know what files are available in /path/to/my-pkg until after we build the .rpm... But if we look at the cache we will see an implicit remote node pointing to my-pkg-1.1.1-1 from PMC. So, if we assume nothing has changed, we can look at the remote node and guess that since we can get that file from my-pkg-1.1.1-1, we can probably expect it from 1.1.1-2 as well.

To this end, we can have three levels of priority for the scheduler:

  • High: put my-pkg here. Check it as fast as possible so we can see if the implicit dependnecy is resolved or not. If it is, we can optimize the graph.
  • Medium: put goal-pkg here. We know anything else under this subgraph will be required for the end goal and we always want to build it.
  • Low: Everything else. We ideally don't want to build anything here, but in the case where the implicit dependencies aren't found based on our guess we need to fall back to the old mechanism (random builds)

An added complexity: The scheduler will queue all packages into the build channels in a greedy fashion. We need a mechanism to ignore the low priority channels, or avoid queuing low priority packages, until such time as we get stuck and failed to optimize.

Some test results:

# Prep build to remove variability from network
sudo make clean-build-packages &&  sudo make graph-preprocessed input-srpms go-tools  PRECACHE=y CONFIG_FILE=./imageconfigs/core-efi.json
# Build images needed for core-efi.json
time sudo make build-packages PRECACHE=y CONFIG_FILE=./imageconfigs/core-efi.json 2>&1 | tee log.txt
Mode Wall Time Total packages built Wasted builds
New 57m (1hr) 109 0
Old 144m (2.3hr) 477 368

tldr: We wasted over an hour building packages we ended up not using in the old flow.

Change Log
  • Change
  • Change
  • Change
Does this affect the toolchain?

YES/NO

Associated issues
  • #xxxx
Links to CVEs
  • https://nvd.nist.gov/vuln/detail/CVE-YYYY-XXXX
Test Methodology
  • Pipeline build id: xxxx

dmcilvaney avatar Oct 09 '23 21:10 dmcilvaney

This PR is now re-targeted at 3.0-dev.

dmcilvaney avatar Feb 21 '24 00:02 dmcilvaney