clean: Optimize (legacy) clean with multiple -p specifiers
Co-authored-by: dino [email protected]
What does this PR try to resolve?
This commit optimizes implementation of cargo clean -p by reducing the amount of directory walks that take place.
We now batch calls to rm_rf_prefix_list, thus potentially avoiding multiple walks over a single subdirectory. In practice this helps us significantly reduce the runtime for clearing large workspaces (as implemented in #16263); for Zed, cargo clean --workspace went down from 73 seconds to 3 seconds.
We have 216 workspace members.
How to test and review this PR?
We've tested it by hand, running it against regex, ruff and zed codebases.
This PR is still marked as draft, as I don't love the code. I would also understand if y'all were against merging this, given that new build directory layout is in flight.
Have you tried out the performance of cargo clean --workspace with -Zbuild-dir-new-layout? Unsure what the timeline will be for stabilizing that so I can understand if you want to optimize this in the mean time.
Are we really benefiting from all of this filesystem globbing?
Should we just walk all of the content upfront, get it into a Vec, and then do a retain call on that, pattern matching against the files for the different packages, and then delete what is left?
Have you tried out the performance of cargo clean --workspace with -Zbuild-dir-new-layout?
No, I did not try -Zbuild-dir-new-layout. I'll do that and report back.
Are we really benefiting from all of this filesystem globbing?
I don't think we benefit much from globbing, particularly because most of it's use cases are about a simple prefix/suffix matching. I'm in the process of coalescing file walking further. I'll push that up to this branch before marking this PR as ready to review.
Should we just walk all of the content upfront, get it into a Vec, and then do a retain call on that, pattern matching against the files for the different packages, and then delete what is left?
If we could make the pattern matching part straightforward (to read/maintain) then sure, that'd be best. I'm clinging onto the existing setup for no good reason; the end goal for me is to have a faster cargo clean -p --workspace.
I am very happy to report that -Zbuild-dir-new-layout would make cargo clean --workspace take about a second on Zed codebase. :)
This PR is currently at 2.6s. I think we could reach a similar ballpark with the current build dir layout.. But obviously, this code will be removed at some point either way.
:umbrella: The latest upstream changes (possibly 2b48142c3fc8935044e4d6602ffaedeec0665227) made this pull request unmergeable. Please resolve the merge conflicts.
:umbrella: The latest upstream changes (possibly bfd598547cfae010523315adf50818e3dda58e7f) made this pull request unmergeable. Please resolve the merge conflicts.
r? @ehuss
rustbot has assigned @ehuss. They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.
Use r? to explicitly pick a reviewer
I've rebased the existing implementation on top of #16304 - while we could push the perf further, I think this is good enough (after all, it gets us from 70s down to 3s) for our use case.
r? epage
I don't think we benefit much from globbing, particularly because most of it's use cases are about a simple prefix/suffix matching. I'm in the process of coalescing file walking further. I'll push that up to this branch before marking this PR as ready to review.
In my mind, we would do file walking by
let mut cache: HashMap<PathBuf, Vec<OsString>> = Default::default();
let mut to_remove: Vec<PathBuf> = Default::default();
for ... {
//...
let file_names = cache.entry(dir).or_insert_with(|| std::fs::read_dir(dir));
file_names.retain(|name| {
let remove name.starts_with(prefix) && name.ends_with(suffix);
if remove {
to_remove.push(dir.join(name));
}
!remove
});
}
for path in to_remove {
self.rm_rf(path)?;
}
(psuedo code, no preference on the exact details including whether cache tracks file names or paths, likely it should be wrapped in a struct for easier use)
Benefits
- Could then apply to the new layout as well. While the new layout benefits is already fast, we've been fixing bugs where we've been doing more globbing (including one still pending, #16335).
- More flexibility for deciding what gets removed rather than having to fit within any specific rule system of prefix+suffix, globs, etc.
- Guaranteed that each directory is only read once
This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed.
Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.
This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed.
Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.
New version brings Zed's clean time from 3 seconds or so to 1.7s. Nice. :)