GH-116380: Speed up `glob.glob()` by removing some system calls
Speed up glob.glob() and glob.iglob() by reducing the number of system calls made.
This unifies the implementations of globbing in the glob and pathlib modules.
Depends on
- #117589
- #117732
Filtered recursive walk
Expanding a recursive ** segment entails walking the entire directory tree, and so any subsequent pattern segments (except special segments) can be evaluated by filtering the expanded paths through a regex. For example, glob.glob("foo/**/*.py", recursive=True) recursively walks foo/ with os.scandir(), and then filters paths through a regex based on "**/*.py, with no further filesystem access needed.
This solves GH-104269 as a side-effect.
Tracking path existence
We store a flag alongside each path indicating whether the path is guaranteed to exist. As we process the pattern:
- Certain special pattern segments (
"","."and"..") leave the flag unchanged - Literal pattern segments (e.g.
foo/bar) set the flag to false - Wildcard pattern segments (e.g.
*/*.py) set the flag to true (because children are found viaos.scandir()) - Recursive pattern segments (e.g.
**) leave the flag unchanged for the root path, and set it to true for descendants discovered viaos.scandir().
If the flag is false at the end, we call lstat() on each path to filter out missing paths.
Minor speed-ups
We:
- Exclude paths that don't match a non-terminal non-recursive wildcard pattern prior to calling
is_dir(). - Use a stack rather than recursion to implement recursive wildcards.
- Pre-compile regular expressions and pre-join literal pattern segments.
- Convert to/from
bytes(a minor use-case) iniglob()rather than supportingbytesthroughout. This particularly simplifies the code needed to handle relative bytes paths withdir_fd. - Avoid calling
os.path.join(); instead we keep paths in a normalized form and append trailing slashes when needed. - Avoid calling
os.path.normcase(); instead we use case-insensitive regex matching.
Implementation notes
Much of this functionality is already present in pathlib's implementation of globbing. The specific additions we make are:
- Support for
dir_fd - Support for
include_hidden - Support for generating paths relative to
root_dir
Results
Speedups via python -m timeit -s "from glob import glob" "glob(pattern, recursive=True, include_hidden=True)" from CPython source directory on Linux:
| pattern | speedup |
|---|---|
Lib/* |
1.67x |
Lib/*/ |
1.77x |
Lib/*.py |
1.22x |
Lib/** |
5.3x |
Lib/**/ |
1.31x |
Lib/**/* |
1.9x |
Lib/**/** |
16.8x |
Lib/**/*/ |
2.22x |
Lib/**/*.py |
1.81x |
Lib/**/__init__.py |
1.11x |
Lib/**/*/*.py |
2.37x |
Lib/**/*/__init__.py |
1.8x |
- Issue: gh-116380
Needs a fix for #116377 to land.
Thanks Serhiy! We use os.scandir() if either:
- We're expanding a recursive wildcard (we need to distinguish directories in order to recurse)
- We're expanding a non-final non-recursive wildcard (we need to select only directories)
If neither of these are true, then we don't need to stat() the children, and so os.listdir() is actually a little faster I think. But I will test this on a few machines to be sure!
edit: to further illustrate what I mean, here's where os.listdir() is used:
| non-recursive part | recursive part | |
|---|---|---|
| non-terminal part | os.scandir() |
os.scandir() |
| terminal part | os.listdir() <-- |
os.scandir() |
The new implementation should be benchmarked with different test cases: deep and wide threes, files and directories domination.
I've been looking into this! The randomfiletree project is helpful - it can repeatedly walk a tree and create child files/folders according to a gaussian distribution, which seems to me like a good approximation for an average "shallow and wide" filesystem structure, including tweaking for file or folder distribution.
It's difficult to produce "deep and narrow" trees this way, as the file/folder probability would need to change with the depth (I think?). I've been considering writing a tree generator that works this way, e.g.:
- At
depth==0, generate 100 subdirectories - At
0 < depth < 50, generate 1 subdirectory - At
depth==50, generate 100 files
... but is that overly arbitrary? Is there a better way? Or do I just need to come up with a bunch of test cases along those lines?
A test of 100 nested directories named "deep" from my Linux machine:
| pattern | speedup |
|---|---|
deep/** |
3.86x |
deep/**/ |
4.03x |
deep/**/* |
4.92x |
deep/**/*/ |
4.93x |
It should be possible to implement pathlib.Path.glob() atop this new code, which would provide a tasty speed-up, because intermediate Path objects wouldn't be generated. To do this we'd need to add follow_symlinks and case_sensitive arguments to the internal selector functions.
From there it would be simple to expose follow_symlinks and thereby resolve GH-73661.
Is this too long to review? If so I can try staging it in smaller PRs.
With much faffing, I've managed to unify the pathlib and glob implementations of globbing in this patch. This makes the pathlib implementation faster because it doesn't need to spin up intermediate Path objects. The glob implementation is also much faster, but not quite as fast as in earlier versions of this patch. It all comes out code-negative too!
Marking as a draft as I think this is too large to land. But I have an idea of how to stage it!
I have a plan for landing this incrementally, in a way that should be much more reviewable.
- Add a
glob._Globberclass that implements only pathlib globbing.- Speeds up pathlib globbing by working with strings internally
- Select literal parts with
lstat()rather thanscandir()- Speeds up pathlib globbing by reducing syscalls made / work done
- Support excluding hidden files internally
- Necessary to support
glob.glob()
- Necessary to support
- Support
dir_fdinternally- Necessary to support
glob.glob()
- Necessary to support
- Switch
glob.glob()andiglob()to use_Globber- See description of this PR for benefits.
First PR is here: #117589
I'll leave this PR open mostly for my own reference :P.
Right, this PR is now about as small as I can make it. It does these things specifically:
- Adds support for
dir_fdandinclude_hiddeninpathlib._glob.Globber - Makes
glob.iglob()(thenceglob()),glob0()andglob1()useGlobber- Adds
_split_pathname()to explode a path into an anchor and a stack of parts - Adds
_relative_glob()to generate paths that are relative toroot_dir - Deletes private functions that are now unused
- Adds
I'll update the timings in the PR description momentarily.
@barneygale Thanks for all the refactoring so far, this version of the PR is indeed much smaller and easier to review. The code looks good to me, although to be honest I have not enough experience to judge whether all the subtle cases are handled correctly.
The globbing is much faster for the relevant cases, but for globbing a non-existent directory the new PR is a bit slower. On my system
python -m timeit -s "from glob import glob" "x=glob('directory_does_not_exist/*', recursive=True, include_hidden=True)"
is 14 usec on main and 16.3 usec with this PR. No reason not to accept this PR, but maybe add it to the benchmarks and investigate later.
I tried to check coverage of the new code and there are a few lines missing coverage. Some missing lines seem fine (I only tested windows, some OSError exceptions), but these stand out:
- In
pathlib._globthe methodopen_dirhas no coverage for the second and third branch of theifstatement - In several methods (e.g.
select_recursiveorselect_wildcards) the code path inif dir_fd is not None:is not reached
Thanks so much for the feedback, that's most helpful!
The globbing is much faster for the relevant cases, but for globbing a non-existent directory the new PR is a bit slower. On my system
python -m timeit -s "from glob import glob" "x=glob('directory_does_not_exist/*', recursive=True, include_hidden=True)"is 14 usec on main and 16.3 usec with this PR. No reason not to accept this PR, but maybe add it to the benchmarks and investigate later.
Will do. I believe the new code immediately calls os.scandir(), which fails in this case, whereas the old code first calls os.lstat(), which again fails but slightly more quickly. If the directory does exist (which should be the common case), the new code saves an os.lstat().
I tried to check coverage of the new code and there are a few lines missing coverage. Some missing lines seem fine (I only tested windows, some
OSErrorexceptions), but these stand out:
In
pathlib._globthe methodopen_dirhas no coverage for the second and third branch of theifstatementIn several methods (e.g.
select_recursiveorselect_wildcards) the code path inif dir_fd is not None:is not reached
I think this is because paths relative to directory descriptors aren't supported on Windows, but I'll check!
Hey @gpshead and @serhiy-storchaka - would you like to review this? If not, I might make an appeal on discuss.python.org in a couple weeks' time. Thanks for the feedback so far, and no worries at all if you'd rather focus your efforts elsewhere!
Just spotted an issue: previously iglob() returned a generator with a close() method that would os.close() any open file descriptors. I'll get this re-instated.
@picnixz to address some of your comments on using map() rather than looping and yield: I did this so that calling close() on the iglob(dir_fd=blah) generator causes os.close() to be called on all open file descriptors, which seems to work with a stack of for loops but not map(). But I didn't add a test case - I'll do that now :)
That's... an interesting functionality I wasn't aware of :) If someone could explain to me the reason I'd be happy. Anyway, let's keep your loops.