cargo
cargo copied to clipboard
(Option to) Fingerprint by file contents instead of mtime
Describe the problem you are trying to solve The primary problem I have is that when building my code on travis, the actual code in my workspace builds every time, even though much of it hasn't changed and I have target directory caching on. The reason is that travis makes a new clone of my git repo, which doesn't preserve mtimes. This can add about 5 minutes to every travis run. My project is mixed rust and non-rust code, so this adds 5 minutes to those runs even if no rust code has been affected. I started futzing with mtimes, but that seems fragile and not solving the root of the problem.
Additionally, edit-undo loops cause re-compilation locally, which is a little annoying.
Describe the solution you'd like
Add a new LocalFingerprint::ContentBased(Digest, PathBuf) variant to https://github.com/rust-lang/cargo/blob/b84e625c4401be3893a4d6d44d5dbac82b33f1c4/src/cargo/core/compiler/fingerprint.rs#L204-L209 which reads the content of the PathBuf, passes it through a SipHasher, and mixes that into any aggregate fingerprints. Use this instead of LocalFingerprint::MtimeBased.
Notes
This will probably slow down no-op builds slightly (in some circumstances, such as with large build script inputs over NFS, significantly), so may want to be behind a flag (perhaps --fingerprint-strategy={mtime,content}).
This would probably also make more shared caching (that people talk about a lot, most recently at https://internals.rust-lang.org/t/idea-cargo-global-binary-cache/9002) easier.
I'd be happy to implement this if the PR is likely to be accepted :)
This would probably also fix
- #4425
- #3076
- #7775
- #5918
- #8868 ?
- #10175 ?
- #12060
- #13119
Funnily enough this is also highly related to recent discussions on https://github.com/rust-lang/cargo/issues/2426 and https://github.com/rust-lang/cargo/pull/6484 for entirely different reasons (filesystems with second-level mtime granularity and modifications/builds all happen in the same second).
cc @Eh2406, @matklad, this might be a good tracking issue for this!
At the moment I would start by removing the mtime based system and replacing it with a hash based one. (except for the corner cases where we have to use mtime, as we don't know which files to hash until after build. cc #5918) Then I would see how big a perf impact this has in reality. If it is small then grand, if not then we need to have a hybrid system that uses mtime to decide whether to hash.
I was thinking of trying to impl this soon, but would much prefer to help you do it @illicitonion. I seem to recall someone had an abandoned branch with a start... was it @ehuss?
Do we want to use SipHasher or a different hashing scheme that is designed for fingerprinting? I only ask because of this:
"Although the SipHash algorithm is considered to be generally strong, it is not intended for cryptographic purposes. As such, all cryptographic uses of this implementation are strongly discouraged."
We're not really too interested in cryptographic hashing here for its security properties, so I think picking any reasonable fast algorithm should be fine (and I think SipHasher is reasonably fast as well). I would personally prefer we not wholesale switch away from mtimes just yet, but having this as at least a runtime option to choose from is a great start!
Would be interesting to compare it to using FxHash.
I'm not sure what I did with my old branch, so I'm just going off memory. My original attempt tried to only hash the contents if the mtime was near the start of compile time. However, since it is not known which files to hash until after compilation, it didn't work out too well. I don't know how to solve that problem. I guess one option is to pre-hash every file in the package directory and maybe hash anything that was included from outside after compilation? But that sounds like it could be very problematic. Or maybe hash protection is just not available for the very first compile (or in other words the second compile is not guaranteed to be correct). Maybe another option is to run --emit=dep-info first, which seems to be pretty fast, but there might be some races still.
I don't know of any easy answers here.
I just came across a long blog post on the troubles with using mtime in build tools. It discusses using only hash, and points out that go switched to that, but recommends a hybrid approach. Witch is surprisingly similar to what we are doing with the fingerprint hash as a database.
That blog post [1] is an excellent reference! It describes some of the problems in using mtimes and hashes in build systems and describes the approach used by the tool redo, roughly:
- Storing cheap metadata (mtime, size, inode number, owner uid and gid, a sequence number (for outputs only)) for each input and output file.
- Rebuild whenever any of that metadata changes.
I found it helpful to consider the sequence number for an output as a logical time that we control, so it should avoid problems with low resolution, mtime going backward, builds being missed.
This wouldn't stop the false-positive builds we see on CI systems that don't preserve mtimes, but to avoid these we could check hashes of source files when their metadata changes. One of the problems claimed in [1] is that hashing large build output files (to confirm downstream should rebuild) is expensive, but we could benchmark this to examine the trade offs. Maybe we could get away with hashing source files only?
Please also have a look at chash. This is a mechanism for a meaningful fingerprint (based on the AST of the program). Maybe this can be adapted for cargo as well.
I was just hit by this bug. Docker doesn't refresh timestamps of already existing files and I hacked around the need to cache dependency builds, so I did echo fn main() {} > src/main.rs. Pretty painful to debug and the touch I added to work around looks clunky.
Steps to reproduce
- Create the following files;
Dockerfile
FROM rust
COPY project/Cargo.toml project/Cargo.toml
RUN cd project && mkdir src && echo 'fn main() { println!("hello1"); }' > src/main.rs && cargo build --release
COPY project/src project/src
RUN cd project && cargo run --release
src/Cargo.toml
[package]
name = "hello"
version = "0.1.0"
authors = ["Jacek Wielemborek"]
edition = "2018"
[dependencies]
flate2 = "*"
project/src/main.rs
fn main() {
println!("Hello2");
}
- Run
docker build .
Expected:
Hello2 is printed
Actual result:
hello1 is printed
Workaround
Before building, do touch project/src/main.rs
I think this also has a chance of speeding up some dev workflows around switching branches. Specifically, I often check out branches temporary to look at the PRs (without building them). Sometimes I also run git switch master && git pull --rebase && git switch my-feature-branch && git rebase master. I think both of these workflows lead to busted mtimes for a bunch of files, which haven't actually changed between compilations. For sprawling workspaces (like rust-analyzer one), I hypothesize that this leads to some unneeded crate rebuilds.
Has anyone got a PR or partial WIP PR on this yet? This would be huge to have the option to do this for docker caching which as rust builds get bigger becomes more important for enterprises. Would love to be able to test this out in nightly under a -Z flag.
If there was progress it would have been linked here. As I recall I thought it could be done strate forwardly, but the fact that Eric's branch did not work suggest I was dunning-kruger myself. My memory of the approach was to change the code that reads the mtime to return a new type (mtime, Option<hash>). If the -Z flag is set fill in the hash. Thread that type all the way thru. When we get to the compare step, check the -Z flag for witch to use. If you are offering to try, that is where I would start.
Was just checking there's no PR out there to build on. If the mtime's not the same we could check cheaply if the file size was the same before doing an expensive hash contents check - that could cut the perf costs down a bit.
I believe this problem was already extensively studied in the past, so there's no need to reinvent the solution and just pick a good existing one. For example, git's index is known to store mtime+ctime+device+inode+hash.
ctime, device and inode will thwart trying to do clever things like copying the target dir and putting it back but on a different machine/dir. When mtime check fails it's probably best to rely on intrinsic properties of the content (size/hash). For a non-cryptographic hash, redox's seahash seems pretty quick.
mtime comparison considered harmful linked above is an excellent blog post! I've ran into one of the problems described there today:
If you accidentally set your system clock ahead by a day and build some stuff, then set your clock back to the present, all the stuff you built during that time will show up as "in the future" and therefore newer than source files you edit today, preventing all rebuilds
I finally got around to taking a look at this, and have put together a draft branch. It's a relatively simple change, and not particularly invasive. The top three commits each have timings of doing clean builds of cargo (which transitively contains 588 files) at that commit averaged over 150x each on a 13" MacBook Pro with the source code on an SSD. The tl;dr is that the extra time taken to digest files is in the noise. The top commit is optional, and probably not worth proposing. The second commit is the implementation of this feature.
A quick summary of the approach is: When parsing .d files, digest each file mentioned and include its digest in the generated dep-info file. When a fingerprint is invalid due to mtime, compare the current digest with that in the dep-info file and don't re-build if the digests are the same.
I don't think that branch is ready to propose as-is. There's one issue I think we should work out, a "how to roll out" question, and one corner I think we can leave unfinished but maybe folks have some ideas on how to approach:
The blocking issue
The current idea is that when cargo is parsing the .d files produced by rustc, for each file which is mentioned, cargo will read and digest that file, and store the digest in the dep-info file for the target along with the list of paths. But this causes a race condition - what if the file was changed on-disk after rustc consumed it? cargo currently avoids this race-condition by using the mtime at the start of the build, so that it may get spurious re-builds if the file was changed mid-build but won't miss changes.
I think the correct solution here is to teach rustc how to optionally emit a digest of the file it actually read. It already read the file anyway, so this adds only a trivial cost (my benchmarks showed that time spent reading the files dominated time spent hashing the files by orders of magnitude). There's a question as to how rustc should communicate this digest to cargo though. Some options that come to mind:
- This could be an extra (optional) field in the
job_queue::Message::Finishmessage (or a new message, I guess) - We could just teach
rustcto output thecargo-format dep-info file in addition to therustc-format dep-info file (again, perhaps optionally) so thatcargodoesn't need to parse and re-emit the same data, and add the digest to this file's schema. - We could extend the
.dfile syntax thatrustcalready produces, though this may cause some compatibility issues with other tools which use the same format.
How to roll out
The cargo dep-info files currently have a hard-to-migrate schema (they store length+bytes raw bytes for each path with no tags or similar). I propose the simplest way forwards is to create schema version 2, have a flag to flip between them, and simply look at dep-2-lib-crate_name if the flag is set to 2 and dep-lib-crate_name if using the current behaviour. We won't bother with falling back to old schemas - you wouldn't be able to share cache hits across schema versions.
Alternatively we could start using a more sophisticated schema for this data file (e.g. some existing or novel approach which would allow for optional fields), but it's probably not worth that much thought.
The unfinished corner
Currently for build scripts, we don't digest files emitted as rerun-if-changed - similar to with rustc, there's a race condition where we can't be sure we're digesting the right file. Accordingly, build scripts have the same mtime-based rerun behaviour as they currently do. I don't think we can easily automatically solve this, but I think that fallback is also fine.
In the long-term, build scripts may evolve into something more declarative rather than being so imperative (where they could pre-declare what the build script would need to access, so we could properly track the files), or we could provide some mechanism for build scripts to tell cargo what digest should be used for a file which is rerun-if-changed (which needn't be a pure content digest - e.g. a build script could decide to strip comments out of files it reads before digesting them).
Thank you for the clear explanation of the blocking issue with this approach! I think there was some work towards a "rustc in the loop" version at https://github.com/rust-lang/cargo/pull/8623, bushing that over the finish line would be wonderful!
@illicitonion maybe we can join forces. I got a bit distracted but got quite far down this line. Have a look at https://github.com/rust-lang/rust/pull/75594 - I will have a look at yours too.
I got a bit bogged down on the cargo side dealing with the rerun if changed bits, but it was looking pretty promising and am very happy with the rustc changes.
Gosh, I'm not sure how I missed that #8623 came into existence - I'll do some catching up on it soon and get back to you, very happy to work together on things if we can!
I just want it solved, I don't mind how we do it. One thing I did notice though is that to stop rebuild cacades we do need to get the rerun-if-changed stuff right. I don't think there's much we can do about old style rerun-if-changed, but the new style I think we can be pretty efficient with (and we can then PR blitz major crates that still use the old style - E.g. https://github.com/rust-lang/libc/pull/2138/files ).
A particular(ly infuriating) edge case is this for me is when using vendored dependencies, where cargo already has, and verifies, a JSON file with cryptographic (sha256) hashes of every file in the crate. And yet still rebuilds based on mtime.
Currently for build scripts, we don't digest files emitted as
rerun-if-changed- similar to withrustc, [...]
I got to this issue due to the fact that I generate some css files from scss in the build.rs. Thus rerun-if-changed using fingerprinting for change detection is absolutely necessary for such a use-case. (I wouldn't mind just outputting something like cargo:rerun-if-changed-checksum=some-hash:7e0f5f3b24ce5f889eb2a27ed00dae49:some-path.)
BTW, since #8623 was just closed, I'm assuming there is no progress on this topic anymore?
I’m tired. The ideas are fine but it really needs chopping up into a few easier to land PRs.
On Thu, 31 Mar 2022 at 11:39, Ciprian Dorin Craciun < @.***> wrote:
Currently for build scripts, we don't digest files emitted as rerun-if-changed - similar to with rustc, [...]
I got to this issue due to the fact that I generate some css files from scss in the build.rs. Thus rerun-if-changed using fingerprinting for change detection is absolutely necessary for such a use-case. (I wouldn't mind just outputting something like cargo:rerun-if-changed-checksum=some-hash:7e0f5f3b24ce5f889eb2a27ed00dae49:some-path .)
BTW, since #8623 https://github.com/rust-lang/cargo/pull/8623 was just closed, I'm assuming there is no progress on this topic anymore?
— Reply to this email directly, view it on GitHub https://github.com/rust-lang/cargo/issues/6529#issuecomment-1084401572, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGEJCBHUSPNSMA5764QSX3VCV6GPANCNFSM4GO5D5XQ . You are receiving this because you commented.Message ID: @.***>
@gilescope I could help doing some of that work.
Feel free to give it a go. I think the first PR is in the rustc pr - having a cut down pr that just adds in the hash when creating the libraries. I don’t think landing that would need a cargo pr to land.
On Thu, 14 Jul 2022 at 15:31, Igor Matuszewski @.***> wrote:
@gilescope https://github.com/gilescope I could help doing some of that work.
— Reply to this email directly, view it on GitHub https://github.com/rust-lang/cargo/issues/6529#issuecomment-1184520947, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGEJCHMXSLE5RRPKSCVN4DVUAQFJANCNFSM4GO5D5XQ . You are receiving this because you were mentioned.Message ID: @.***>
I think I'm seeing this issue when trying to restore the cargo cache from S3 to a Docker GitHub Actions runner hosted in EC2. After restoring the cache mount to the buildkit cache context using this approach, cargo unpacks a lot of files into the $CARGO_HOME/registry/src directory. These files are newer by less than half a second relative to the cache age. Is there currently any way I can work around this? Example of the libz-sys dep getting marked old even though it isn't:
#21 1.604 [2023-05-22T11:43:00Z DEBUG cargo::core::compiler::fingerprint] new local fingerprints deps "/usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/libz-sys-1.1.9"
#21 1.604 [2023-05-22T11:43:00Z DEBUG cargo::core::compiler::fingerprint] max output mtime for "/usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/libz-sys-1.1.9" is "/platform/target/release/build/libz-sys-5ecc07734e87e07a/build_script_build-5ecc07734e87e07a" 1684755709.566429337s
#21 1.604 [2023-05-22T11:43:00Z DEBUG cargo::core::compiler::fingerprint] max dep mtime for "/usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/libz-sys-1.1.9" is "/platform/target/release/deps/libpkg_config-aff5b77a7472bb47.rlib" 1684755709.314425018s
#21 1.604 [2023-05-22T11:43:00Z DEBUG cargo::core::compiler::fingerprint] max dep mtime for "/usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/libz-sys-1.1.9" is "/platform/target/release/deps/libcc-afd26af49447c5d9.rlib" 1684755709.418426800s
#21 1.604 [2023-05-22T11:43:00Z DEBUG cargo::core::compiler::fingerprint] all paths up-to-date relative to "/platform/target/release/.fingerprint/libz-sys-5ecc07734e87e07a/dep-build-script-build-script-build" mtime=1684755708.994419534s
#21 1.604 [2023-05-22T11:43:00Z DEBUG cargo::core::compiler::fingerprint] filesystem up-to-date "/usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/libz-sys-1.1.9"
#21 1.604 [2023-05-22T11:43:00Z DEBUG cargo::core::compiler::fingerprint] max output mtime for "/usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/libz-sys-1.1.9" is "/platform/target/release/build/libz-sys-79794549bcb2869b/output" 1684755709.198423031s
#21 1.604 [2023-05-22T11:43:00Z DEBUG cargo::core::compiler::fingerprint] max dep mtime for "/usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/libz-sys-1.1.9" is "/platform/target/release/build/libz-sys-5ecc07734e87e07a/build_script_build-5ecc07734e87e07a" 1684755709.566429337s
#21 1.604 [2023-05-22T11:43:00Z INFO cargo::core::compiler::fingerprint] dependency on `build_script_build` is newer than we are 1684755709.566429337s > 1684755709.198423031s "/usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/libz-sys-1.1.9"
My workaround is an external tool. It's currently in a private repo, but here's the tl;dr:
/// `retimer` saves modification times (mtimes) of source files, and restores them only if the
/// contents match a hash.
///
/// `retimer`'s intended workflow is as follows:
///
/// 1. restore `.retimer-state` and `target` from CI cache, if available
/// 2. `retimer restore`
/// 3. `cargo build`
/// 4. `retimer save`
/// 5. save `.retimer-state` and `target` in CI cache for the next run
///
/// This allows `cargo`'s mtime-based caching to work properly with CI/git.
My workaround is more aggressive but doesn't require storing a state file:
- set all mtimes in source directories to a specific date in the past
- set all mtimes for cached files in the target directory to another specific date in the past (1 day later than the first)
This does however require that one be careful about which files are included in the cache, as it can result in cargo not rebuilding things that it should.
@scottlamb I read your description of retimer here (and your detailed comments on #2644) with great interest, I am currently struggling with mtimes while restoring cargo cache from S3 into a docker cache mount. Are you planning on publishing this tool, or would you be able to share more details/code so I could try and work it into a GitHub Action I'm working on for this purpose?