monobuild icon indicating copy to clipboard operation
monobuild copied to clipboard

Dependency-aware cache key (to allow memoization of build steps)

Open StuartHarris opened this issue 6 years ago • 3 comments

Feature request: a monobuild command that calculates a hash for a component (or list of components). This would facilitate the ability to memoize build steps (e.g. when using an artefact cache). Each hash would be calculated as a Merkle hash of itself and all of its dependencies' hashes.

StuartHarris avatar May 18 '18 10:05 StuartHarris

Some thoughts on this, these are more my notes so that I don't forget the thinking:

  • It's much easier to do this on git's Merkle tree (i.e. a committed revision in git) than on the working directory. (And it's trivial to commit a working directory, even temporarily).
  • Calculating the cache key is equivalent to filtering the repo down to only the contents of the directories of the given component and its dependencies, nothing else contributes to its cache id.

That means that in order to construct a cache key of a given component (and its dependencies), we can:

  1. Find all (strong and weak) dependencies of the component - this will produce a list of paths, e.g.
    app1
    libs/lib1
    libs/lib3
    
    These are path prefixes we want to keep
  2. Recursively prune the full git tree to only contain files which match the prefix, i.e.
    • Turn the list of paths into a tree structure (using the / file delimiter) - the "filter tree", e.g. for the above:
      app1
      libs
        lib1
        lib3
      
    • Walk the filter tree and the git file tree at the same time, until we reach the leaves of the filter tree
    • On the way up, format trees (using mk-tree or equivalent) that only contain the paths in the filter tree, discard everything else
    • At the top, we end up with a new tree file
  3. The new root tree oid (sha) is our cache key.

When any of the leaves (i.e. components) change, their tree file will be different and therefore the resulting root tree will be different, resulting in a different cache id.

If components are nested in one another, it forms implied weak dependencies (changing a sub-component changes the tree of its parent component as well) so everything still works as expected.

The nice thing is this scales with the number of dependencies of a component, not the size of the repo, so it should be pretty fast even on huge monorepos, unless the dependency graph is massive (at which point you've got a bigger problem to worry about).

It's not particularly hard to implement this, but I wonder whether there are git plumbing commands that would let us just do it.

charypar avatar May 21 '18 21:05 charypar

This looks very useful http://alx.github.io/gitbook/7_raw_git.html

charypar avatar May 21 '18 21:05 charypar

Sounds like Nix to me :) https://nixos.org/nix/

lpil avatar Dec 29 '18 15:12 lpil