cargo icon indicating copy to clipboard operation
cargo copied to clipboard

Implement fine grain locking for `build-dir`

Open ranger-ross opened this issue 2 months ago • 10 comments

This PR adds fine grain locking for the build cache using build unit level locking. I'd recommend reading the design details in this description and then reviewing commit by commit. Part of #4282

Previous attempt: #16089

Design decisions / rational

  • Using build unit level locking instead of a temporary working directory.
    • After experimenting with multiple approaches, I am currently leaning to towards build unit level locking.
    • The working directory approach introduces a fair bit of uplifting complexity and I further along I pushed my prototype the more I ran into unexpected issues.
      • mtime changes in fingerprints due to uplifting/downlifting order
      • tests/benches need to be ran before being uplifted OR uplifted and locked during execution which leads to more locking design needed. (also running pre-uplift introduces other potential side effects like the path displayed to the user being deleted as its temporary)
    • The trade off here is that with build unit level locks, we need a more advanced locking mechanism and we will have more open locks at once.
    • The reason I think this is a worth while trade of is that the locking complexity can largely be contained to single module where the uplifting complexity would be spread through out the cargo codebase anywhere we do uplifting. The increased locks count while unavoidable can be mitigated (see below for more details)
  • Risk of too many locks (file descriptors)
    • On Linux 1024 is a fairly common default soft limit. Windows is even lower at 256.
    • Having 2 locks per build unit makes is possible to hit with a moderate amount of dependencies
    • There are a few mitigations I could think of for this problem (that are included in this PR)
      • Increasing the file descriptor limits of based on the number of build units (if hard limit is high enough)
      • Share file descriptors for shared locks across jobs (within a single process) using a virtual lock
        • This could be implemented using reference counting.
      • Falling back to coarse grain locking if some heuristic is not met

Implementation details

  • We have a stateful lock per build unit made up of multiple file locks primary.lock and secondary.lock (see locking.rs module docs for more details on the states)
    • This is needed to enable pipelined builds
  • We fall back to coarse grain locking if fine grain locking is determined to be unsafe (see determine_locking_mode())
  • Fine grain locking continues to take the existing .cargo-lock lock as RO shared to continue working with older cargo versions while allowing multiple newer cargo instances to run in parallel.
  • Locking is disabled on network filesystems. (keeping existing behavior from https://github.com/rust-lang/cargo/pull/2623)
  • cargo clean continues to use coarse grain locking for simplicity.
  • File descriptors
    • I added functionality to increase the file descriptors if cargo detects that there will not be enough based on the number of build units in the UnitGraph.
    • If we aren’t able to increase a threshold (currently number of build units * 10) we automatically fallback to coarse grain locking and display a warning to the user.
      • I picked 10 times the number of build units a conservative estimate for now. I think lowering this number may be reasonable.
    • While testing, I was seeing a peak of ~3,200 open file descriptors while compiling Zed. This is approximately x2 the number of build units.
      • Without the RcFileLock I was seeing peaks of ~12,000 open fds which I felt was quiet high even for a large project like Zed.
  • We use a global FileLockInterner that holds on to the file descriptors (RcFileLock) until the end of the process. (We could potentially add it to JobState if preferred, it would just be a bit more plumbing)

Open Questions

  • [x] How do we want to handle locking on the artifact directory?
    • We could simply continue using coarse grain locking, locking and unlocking when files are uplifted.
    • One downside of locking/unlocking multiple times per invocation is that artifact-dir is touch many times across the compilation process (for example, there is a pre-rustc clean up step Also we need to take into account other commands like cargo doc
    • Another option would to only take a lock on the artifact-dir for commands that we know will uplift files. (e.g. cargo check would not take a lock artifact-dir but cargo build would). This would mean that 2 cargo build invocations would not run in parallel because one of them would hold the lock artifact-dir (blocking the other). This might actually be ideal to avoid 2 instances fighting over the CPU while recompiling the same crates.
    • Solved by #16230
  • [ ] What should our testing strategy for locking be?
    • My testing strategy thus far has been to run cargo on dummy projects to verify the locking.
    • For the max file descriptor testing, I have been using the Zed codebase as a testbed as it has over 1,500 build units which is more than the default ulimit on my linux system. (I am happy to test this on other large codebase that we think would be good to verify against)
    • It’s not immediately obvious to me as to how to create repeatable unit tests for this or what those tests should be testing for.
    • For performance testing, I have been using hyperfine to benchmark builds with and without -Zbuild-dir-new-layout. With the current implementation I am not seeing any perf regression on linux but I have yet to test on windows/macos.
  • [ ] Should we expose an option in the future to allow users to force coarse grain locking?
    • In the event a user’s system can’t support fine grain locking for some reason, should we provide some way to control the locking mode like CARGO_BUILD_LOCKING_MODE=coarse?
    • If so, would this be something that would be temporary during the transition period or a permanent feature?
  • [ ] What should the heuristics to disable fine grain locking be?
    • Currently, it's if the max file descriptors are less than 10 times the number of build units. This is pretty conservative. From my testing, it generally peaks around 2 times the number of build units.
    • I wonder if there is any other information in the crate graph that we could as a heuristic?
  • [ ] How should we interface with file system lock? See https://github.com/rust-lang/cargo/pull/16155#discussion_r2466941516

ranger-ross avatar Oct 26 '25 13:10 ranger-ross

r? @ehuss

rustbot has assigned @ehuss. They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

rustbot avatar Oct 26 '25 13:10 rustbot

On Linux 1024 is a fairly common default soft limit

Afaik the main reason this default exists is that programs using the select syscall don't run into its limitations. If the hard limit is higher and a program doesn't use select it's ok to raise it.

the8472 avatar Nov 01 '25 12:11 the8472

Afaik the main reason this default exists is that programs using the select syscall don't run into its limitations. If the hard limit is higher and a program doesn't use select it's ok to raise it.

Correct! in this commit I added handling to attempt to increase the soft limit if the hard limit is high enough. Otherwise, we fallback to coarse locking.

ranger-ross avatar Nov 01 '25 14:11 ranger-ross

:umbrella: The latest upstream changes (possibly 28163c03f4ecca3afb14175b082164fea1f1289e) made this pull request unmergeable. Please resolve the merge conflicts.

rustbot avatar Nov 05 '25 21:11 rustbot

This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

rustbot avatar Nov 22 '25 13:11 rustbot

Finally coming back to this PR now that we have merged #16230. I rebased to resolve the conflicts and responded to some of the open threads.

ranger-ross avatar Nov 22 '25 15:11 ranger-ross

:umbrella: The latest upstream changes (possibly 8e43074b2365e1f908570d7f5c2ef76b19d1133c) made this pull request unmergeable. Please resolve the merge conflicts.

rustbot avatar Nov 23 '25 03:11 rustbot

This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

rustbot avatar Nov 24 '25 08:11 rustbot

In general, something I realized we need to watch out for is the exact circumstances of why we don't always use -Cextra-filename, whether it is about file names or the entire path. This means moving of built content could cause problems

https://github.com/rust-lang/cargo/blob/8e43074b2365e1f908570d7f5c2ef76b19d1133c/src/cargo/core/compiler/build_runner/compilation_files.rs#L873-L882

epage avatar Nov 24 '25 21:11 epage

:umbrella: The latest upstream changes (possibly dce5a6b8f5c70f8216a73b13cecd4211e81ac79b) made this pull request unmergeable. Please resolve the merge conflicts.

rustbot avatar Nov 25 '25 17:11 rustbot

This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

rustbot avatar Dec 03 '25 14:12 rustbot

:umbrella: The latest upstream changes (possibly e0dd406b88933e82136bbc9073e06d029db8da45) made this pull request unmergeable. Please resolve the merge conflicts.

rustbot avatar Dec 04 '25 05:12 rustbot

This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

rustbot avatar Dec 04 '25 13:12 rustbot

@epage I re-reviewed the changes in this PR and I believe they accomplish the the first step in the plan laid out in #t-cargo > Build cache and locking design @ 💬.

So I think this PR is now good to be reviewed.

ranger-ross avatar Dec 04 '25 13:12 ranger-ross

:umbrella: The latest upstream changes (possibly 1d3fb78257c97258d6ae6368859e36495fb33c8e) made this pull request unmergeable. Please resolve the merge conflicts.

rustbot avatar Dec 04 '25 15:12 rustbot