scikit-build-core
scikit-build-core copied to clipboard
feat: Cache build environments by content hash
Rationale of this PR
Reusing build directories already speeds up repeated pip install invocations significantly because most build artifacts can be reused. However, when using CMake packages provided by pip, as commonly used for pybind11 and friends, CMake will always consider them out of date when building with PEP 517 build isolation enabled (because the temporary build folder changes on every invocation).
Implementation
Write an archive consisting of the build environment provided by pip, and then extract it at a path containing the SHA256 hash of its contents. The archive contents are filtered: __pycache__ files are ignored (see below), and mtime is set to 0 so that it is essentially ignored - CMake should only look at the file path (containing the hash), and completely ignore mtime.
Because the path contains the hash of its contents, this is perfectly safe since a changed build environment will change the hash.
Open Questions
- [ ] Should this be an experimental feature?
- [ ] How should enabling the feature work? Using an enable setting, or perhaps by setting a cache path?
- [ ] Why are hashes of
__pycache__files changing on every invocation? Is it about the file metadata, or does pip somehow change the files themselves? - [ ] What about very large build environments? The archival and extraction is currently run on every invocation. With larger archive sizes, performance would likely be quite degraded.
- [ ] Should the archive be split per package? This would allow reusing existing cache directories more often, while complicating the code somewhat.
I hacked this PR together when I got the idea to do hash-based caching earlier today. I'm looking forward to any comments, especially whether this is a good fit for scikit-build-core and whether is a good idea.
I'm not an expert on pip internals so perhaps there are some details about the temporary build environment that I missed. Perhaps there's a simpler way to solve all this.
TODO
- [ ] Replace use of
hashlib.file_digest, it is only available in Python 3.11+
I haven't added any tests or documentation yet because I didn't want to waste any work in case this approach is not a good fit for scikit-build-core. @henryiii (or other maintainers) If you don't mind, I'd love to hear some feedback as to whether this idea is viable and whether this approach is the way forward. Thanks!
I haven't had time to look into it yet - the question I would answer first, does the current mechanism not work? We record the paths, then if that recording exists from a previous run, ~~we swap it out for the current path in the cache~~. IIRC that was to address this issue. https://github.com/scikit-build/scikit-build-core/blob/5aa2d28b68b93d42f4260fabecd7c213e45ce8b0/src/scikit_build_core/cmake.py#L89-L106
Ah, no, I don't think we do, I think for now we just remove everything. That was the original plan though - would that work? We do a search and replace on the cache file to put the new paths in.
The benefit of storing the actual environment is you could then work without pip restoring it (say in editable mode), but it's pretty unusual and unexpected to grab a copy of the build environment. I think modifying the paths (and/or times) based on a stored value (we could also store hashes) and relying on the provided build environments would be better?
I haven't had time to look into it yet - the question I would answer first, does the current mechanism not work? We record the paths, then if that recording exists from a previous run, ~we swap it out for the current path in the cache~. IIRC that was to address this issue.
I'm not entirely sure what role the mechanism you linked plays. I just checked a .skbuild-info.json file in one of my projects, and its source_dir is simply equal to the real source_dir, so no caching is taking place. But the contents of the source directory should not have any impact on the build environment path issue I'm trying to fix here.
To clarify, what I mean is that CMake might reference paths from the build environment in the actual compile commands.
For example, when including pybind11 using build-system.requires, a typical compile command line might look like this: /usr/bin/c++ ... -isystem /tmp/pip-build-env-0ew390zl/overlay/lib/python3.11/site-packages/pybind11/include ...
Obviously, the /tmp/pip-build-env-0ew390zl part of the include directory will change on every run of pip install, and even if its contents stay the same, Ninja will still consider all files which contain that include flag as out of date.
To give some context (perhaps others use scikit-build-core differently): For the projects I'm using scikit-build-core on, setting tool.scikit-build.build-dir eliminates 90% of the build time since all object files that don't reference pybind11 can be reused. But the files that use pybind11 are still rebuild every time. With this PR, running pip install . on an unchanged source tree results in near-instant builds (it's mostly just pip doing dependency resolution).
The benefit of storing the actual environment is you could then work without pip restoring it (say in editable mode), but it's pretty unusual and unexpected to grab a copy of the build environment. I think modifying the paths (and/or times) based on a stored value (we could also store hashes) and relying on the provided build environments would be better?
I agree that it is not the prettiest solution. I've thought about somehow asking pip to provide a persistent path for the build requirements. But I guess that sort of goes against the spirit of build isolation. Modifying times would not be enough since the rebuilds are triggered by the path itself changing. Modifying paths based on a stored hash is more or less what this PR does, or did you have something else in mind?
BTW: If you're wondering about why the PR first creates, hashes and then extracts an archive instead of just hashing files and copying them directly: The idea is that creating the archive provides a defined wire format on which we can calculate a hash. If we just copied files directly while calculating their hash value, we'd have to combine the hash values in a defined format in order to come up with a hash for directories and finally the entire build environment - which is essentially reinventing what an archive file is. Also, by extracting the temporary archive, we can be sure that we only provide files which we also included in the hash, reducing the chance for bugs due to forgotten files or changed metadata.
(FYI, this is still on my roadmap to investigate!)
To clarify the issue from the cmake side:
- this issue still occurs even when the build directory is specified statically in the configuration? Does Pep517 overwrite the build folder when built in isolation?
- this is about cmake build artifacts and not the source files that might trigger a re-configure?
- this workflow reusing artifacts is specifically for development or also for some CI/CD?