Holmake source-based hashing for dependencies
Recursively compute source-based hashes for theory-target dependencies, for #1522.
A particular theory .dat file depends on its script file and the theory files it loads. An existing .dat file already records the hashes of its ancestor .dat files (which do the same with their ancestors). This much would seem to give us "recursive" hashing of the .dat file.
It seems like all we really need to do then is to also include the hash of the corresponding Script.sml file in the .dat file that it generated.
Actually do we even need that? Can we not reuse build outputs if the .dat files are the same even if the script files changed?
I think the problem is more in the lookup direction: if all you have is script files, how do you know what hash to check the cache for to avoid building?
Well, the script file hashes could be the keys that you looked up in the cache...
It seems a rebuild has to happen if
- the script file is newer than the .dat file, and
- the script's hash is not a key in the cache
Something is wrong if (probably requiring a rebuild or a re-pull from the cache) if:
- the script file's hash is in the cache as a key, but
- the stored .dat file has hashes for ancestors that are not present
Can we please please not reinvent the wheel here? This is not so different from building/compiling in a normal PL environment so there is probably existing work to consult.
I don't like the idea of the .dat files depending on irrelevant (to the generated theory) script file changes.
But I'm happy for the script file hash to be included somewhere for the purposes of this cache.
Regarding following existing practice maybe @digama0 or @pylasnier already knows the design of another build system we ought to copy?
Yeah, agreed. We shouldn't do what I suggested earlier and have script-file hashes in .dat files.
Given that the .dat files then contain all of the relevant hashes of their ancestors, and those ancestors do likewise, it seems like this particular sub-task is already done.
The original Nix system is the work of Eelco Dolstra as described in his PhD thesis. The original work is probably worth looking at for more detail, but I think I can abbreviate it for the purposes of HOL.
Nix uses a flat build store, so theories would get stored in a flat structure looking something like hol-store/<hash>-<name>Theory.dat. Only the hash would be used by HOL to retrieve a theory, and the name is just for human readability.
We would need to distinguish between hashes for source files and for build artifacts, so perhaps Script hashes and Theory hashes. Script hashes are simple hashes of the source Script files, no dependencies considered. Theory hashes need to include dependencies, so we would somehow hash the Script hash and all of the Theory hashes of its dependencies. Theory hashes are the ones which are used to store theories in the store.
Theory hashes can still be computed without ever building anything, but will require calculating the Theory hashes of every dependency and their dependencies, down to the base theories with no dependencies and hashes based only on their Script hash. This is probably where a lockfile might come in to prevent the redundant recalculation of Theory hashes.
A cache would mostly just be a trusted online theory store with the flat structure and indexed by Theory hash. Whenever Holmake tries to build/retrieve a theory, it calculates its Theory hash and queries the cache for that hash. If it hits, it has the Theory. Otherwise, it will try to build/retrieve its dependencies in the same manner then build using the Script file. It will subsequently place the built Theory in the store by its hash.
Yeah, agreed. We shouldn't do what I suggested earlier and have script-file hashes in
.datfiles.Given that the
.datfiles then contain all of the relevant hashes of their ancestors, and those ancestors do likewise, it seems like this particular sub-task is already done.
It's not done if we actually need source hashes. The .dat files currently perfectly achieve recursive hashing for theory build outputs, but not for theory sources. It's still unclear to me how we ought to find the key for cache lookup before building.
@pylasnier I don't think I was fully able to follow your description - maybe we can discuss it on Zulip?
I think we aim first for something minimal: caching the work to produce theories by running script files.
What does "producing a theory" mean? I think what Holmake actually produces when it produces a theory from running a script file fooScript.sml is all these files:
- fooTheory.dat
- fooTheory.sml
- fooTheory.sig
- fooTheory.ui
- fooTheory.uo
I don't know what the .ui and .uo files are for or whether/how to cache them :/ The others are build outputs that I believe should simply be cached verbatim.
What are the inputs to the build process for a script file (i.e. what should be hashed to produced the key into the cache)? I believe it's the script file itself (fooScript.sml) plus all the dependencies (.dat, .sml, .sig, possibly others?) that Holmake itself figures out when building theories currently. So I think we should use Holmake as-is to produce a list of dependencies for a theory, canonicalise their order, and hash their contents to produce the cache key for the above outputs.
Again I don't know how .ui and .uo will affect this, since maybe they contain path information that should not be part of what is hashed (will differ per machine).
The easiest version of this might be if we can just ignore .ui and .uo files and let Holmake do whatever it is already doing with them, but ignore them for caching purposes.
If that is sound, then here's a proposal of what to implement for this issue:
Holmake --cachekey fooTheory: Calculate dependencies as would be used to build fooTheory, but then instead of actually building it, output the hash of the concatentation of the dependencies in a canonical ordering.
Important to remember that the dependencies for fooTheory will be build outputs, not sources (except fooScript.sml itself). So we could potentially lean on the fact that .dat files are already recursively hashed there. But I think it's also fine to just use "all the dependencies" as calculated by Holmake and this is more robust to unusual dependencies coming from Holmakefile rules.
I think what you've described it terms of how the hash for a theory should be produced is good, and how I would imagine a nix-stye system would do it. It probably wouldn't be too hard to add that feature to Holmake as you've described it. And importantly, hashes should be calculable without having to build anything, including dependencies, so that you should be able to potentially get that theory from a cache.
This seems plausible but there are some wrinkles yet, I think. In particular, while a theory file childTheory.dat does indeed depend on its script file (childScript.sml) and related ancestor ancestorTheory.dat files, those ancestorTheory.dat files are themselves build products, and don't exist at $t=0$. So,
And importantly, hashes should be calculable without having to build anything, including dependencies
is just not true.
This may seem weird, but is analogous to the way object file parse.o depends on parse.c, but that is actually generated by a call to yacc, so really depends on parse.grm. What is the Nix approach here: to hash only the leaves of the dependency tree, which are all real source files, and not generated?
I think the approach I described works, fwiw. Build outputs are required for most cache keys in my approach, but you only need to look up those keys when you already have those build outputs. Theoretically the very first cache keys are the only ones that depend only on sources (but in reality users will always have already downloaded some very early build outputs like boolTheory.dat when they download/build Holmake in the first place).
I'm imaging Holmake will alternate these steps: calculate key, lookup cache, download, calculate key, lookup cache, download, calculate key, lookup cache (miss), run a real build, calculate key, etc
I also think that its fine for it to do the calculate key, lookup cache .. steps when an ancestor is changed without its dat changing. It will cache hit every time.
The main thing is that different keys can point to the same build output.
Yes, you're right I think. Do you want to check some build hashes to make sure our current dat-hashing is doing the right thing? On 5e90404ff31 (develop), what is your shasum for
$ shasum src/list/src/.holobjs/listTheory.dat
I get 3d36b05766bb2876baa8ab6141f1dfc2571bd217.
A minor thing, but on Windows, I wonder if anyone will be storing their source files with Windows style line-endings? I think we can certainly set up the repo to prevent people getting them, but if we generate sources in tar-balls or zip-files we may start to get people messed up.
I can confirm 3d36b05766bb2876baa8ab6141f1dfc2571bd217 src/list/src/.holobjs/listTheory.dat
We won't be accepting build outputs into the cache permissionlessly (not sure if that's relevant to your question about Windows style line endings, but just to be clear). Possibly Holmake will need to check (and modify?) the script file line-endings style before it hashes it.
This may seem weird, but is analogous to the way object file
parse.odepends onparse.c, but that is actually generated by a call toyacc, so really depends onparse.grm. What is the Nix approach here: to hash only the leaves of the dependency tree, which are all real source files, and not generated?
The nix approach is that everything is hashed recursively, including tools like yacc which is considered a build-time dependency. So long as building theories is consistently reproducible and deterministic, by pinning all source files and build-time dependencies (which I suppose is most importantly the HOL kernel) then the hash should be reproducible without building.
I'm imaging Holmake will alternate these steps: calculate key, lookup cache, download, calculate key, lookup cache, download, calculate key, lookup cache (miss), run a real build, calculate key, etc
I imagine this would work, but I think from a UX perspective it's not completely ideal. This requires that the user, if they want the late stage theories from CakeML for example, must basically go through the motions of building every theory. If the hashing were entirely source-based, the user should instead just be able to pick a theory from a source-based HOL package repository with a pinned CakeML, for which the Theory hashes are already known/easily calculable, then just the target theory may be grabbed from the cache by its hash.
That's at least how something akin to nix would work. It might be sensible that we don't go the whole way with the nix approach and use some hashing from build artifacts like you have suggested, but I would be cautious about how this might introduce some issues with tangled source-based and build-based dependency graphs in the future.
The nix approach, which I don't fully understand yet but I think means "only hash sources; hash all source dependencies to compute a cache key for a target", sounds like a bigger departure from the status quo (I don't think Holmake currently computes all source dependencies in one place?). I'm not sure about doing that now versus a more incremental change like what I suggested...
Also I guess the specific UX problem you mentioned could be solved by additionally publishing links to download whole packages at specific "package versions" -- it would automate the "hash, lookup, download, repeat" cycle for several iterations when a very late target is known, so you can download all the outputs at once. (Assuming that just doing the cycle the naive way isn't already enough of a speed up compared to the status quo.)
But either way (cache keys from: all sources like nix, or immediate deps like I proposed) would be fine by me. Now we need to make a decision - anyone else want to weigh in?
I don't think following nix is the way to go. Nix has to be too conservative due to it working as a general build system. For example in nix a change of comments in rich_list will require the whole cakeml to be rebuilt.
I don't think following nix is the way to go. Nix has to be too conservative due to it working as a general build system. For example in nix a change of comments in rich_list will require the whole cakeml to be rebuilt.
I personally don't think that's an issue of whether we follow nix, some sort of rebuild could still happen because of changed comments. We could strip comments and/or whitespace before hashing. That could still fit into a nix-style system.
But either way (cache keys from: all sources like nix, or immediate deps like I proposed) would be fine by me. Now we need to make a decision - anyone else want to weigh in?
I think your approach is probably a sensible starting point regardless, so I'm happy with immediate dependency hashing.
I'm personally interested now in the potential for nix as a build system, but I will look into that separately.
Anything less than comparing based on build output is being conservative. Comments are just one feature that a generic hash all sources will not be able to detect. I personally want to have refactoring proofs not causing a rebuild of children theories.
I want to remind you that HOL theories are essentially compiled libraries of SML, and the entire HOL system can be seen as a library of SML. The fact that under PolyML these theories do not have binary files (with machine code) does not mean the same situation in other SML implementations - Moscow ML indeed produces binaries (as loadable dynamic objects). In the future maybe HOL4 will support building under SML/NJ again, and then theory/library loading with calls of prove will be no longer performance problems. If you people modify the building structure too much, making it too PolyML-specifc, it will be harder and harder to support other SML platforms.
I'm imaging Holmake will alternate these steps: calculate key, lookup cache, download, calculate key, lookup cache, download, calculate key, lookup cache (miss), run a real build, calculate key, etc
I imagine this would work, but I think from a UX perspective it's not completely ideal. This requires that the user, if they want the late stage theories from CakeML for example, must basically go through the motions of building every theory. If the hashing were entirely source-based, the user should instead just be able to pick a theory from a source-based HOL package repository with a pinned CakeML, for which the Theory hashes are already known/easily calculable, then just the target theory may be grabbed from the cache by its hash.
You can't run with just the later theory files; you need all of them in order to be able to load them and do your theorem-proving. So, the path from lateCakeMLTheory.uo back to the root theory needs to be populated with all build products. (But that's easy to have Holmake do, as per the process above.)