uv
uv copied to clipboard
More efficient cache-key globbing + support parent paths in globs
Summary
(Related PR: #13438 - would be nice to have it merged as well since it touches on the same globwalker code)
There's a few issues with cache-key globs, which this PR attempts to address:
- As of the current state, parent or absolute paths are not allowed, which is not obvious and is not documented. E.g., cache-key paths of the form
{file = "../dep/**"}will be essentially ignored. - Absolute glob patterns also don't work (funnily enough, there's logic in
globwalkitself that attempts to address it inglobwalk::glob_builder(), which serves as inspiration to some parts of this PR). - The reason for parent paths being ignored is the way globwalker is currently being triggered in
uv-cache-info: the base directory is being walked over completely and each entry is then being matched to one of the provided match patterns. - This may also end up being very inefficient if you have a huge root folder with thousands of files: if your match patterns are
a/b/*.rsanda/c/*.pythen instead of walking over the root directory, you can just walk overa/banda/cand match the relevant patterns there. - Why supporting parent paths may be important to the point of being a blocker: in large codebases with python projects depending on other local non-python projects (e.g. rust crates), cache-keys can be very useful to track dependency on the source code of the latter (e.g.
cache-keys = [{ file = "../../crates/some-dep/**" }]. - TLDR: parent/absolute cache-key globs don't work, glob walk can be slow.
Solution
- In this PR, user-provided glob patterns are first clustered (LCP-style) into pattern groups with longest common path prefix; each of these groups can then be walked over separately.
- Pattern groups do not overlap, so we would never walk over the same directory twice (unless there's symlinks pointing to same folders).
- Paths are not canonicalized nor virtually normalized (which is impossible on Unix without FS access), so the method is symlink-safe (i.e. we don't treat
a/b/..asa) and should work fine with #13438. - Because of LCP logic, the minimal amount of directory space will be traversed to cover all patterns.
- Absolute glob patterns will now work.
- Parent-relative glob patterns will now work.
- Glob walking will be more efficient in some cases.
Possible improvements
- Efficiency can be further greatly improved if we limit max depth for globwalk. Currently, a simple ".toml" will deep-traverse the whole folder. Essentially, max depth can be always set to either N or infinity. If a pattern at a pivot node contains
**, we collect all children nodes from the subtree into the same group and don't limit max depth; otherwise, we set max depth to the length of the glob pattern. This wouldn't change correctness though and can we done separately if needed. - If this is considered important enough, docs can be updated to indicate that parent and absolute globs are supported (and symlinks are resolved, if the relevant PR is algo merged in).
Test Plan
- Glob splitting and clustering tests are included in the PR.
- Relative and absolute glob cache-keys were tested in an actual codebase.