cpython GH-116380: Speed up `glob.glob()` by removing some system calls

Speed up glob.glob() and glob.iglob() by reducing the number of system calls made.

This unifies the implementations of globbing in the glob and pathlib modules.

Depends on

#117589
#117732

Filtered recursive walk

Expanding a recursive ** segment entails walking the entire directory tree, and so any subsequent pattern segments (except special segments) can be evaluated by filtering the expanded paths through a regex. For example, glob.glob("foo/**/*.py", recursive=True) recursively walks foo/ with os.scandir(), and then filters paths through a regex based on "**/*.py, with no further filesystem access needed.

This solves GH-104269 as a side-effect.

Tracking path existence

We store a flag alongside each path indicating whether the path is guaranteed to exist. As we process the pattern:

Certain special pattern segments ("", "." and "..") leave the flag unchanged
Literal pattern segments (e.g. foo/bar) set the flag to false
Wildcard pattern segments (e.g. */*.py) set the flag to true (because children are found via os.scandir())
Recursive pattern segments (e.g. **) leave the flag unchanged for the root path, and set it to true for descendants discovered via os.scandir().

If the flag is false at the end, we call lstat() on each path to filter out missing paths.

Minor speed-ups

We:

Exclude paths that don't match a non-terminal non-recursive wildcard pattern prior to calling is_dir().
Use a stack rather than recursion to implement recursive wildcards.
Pre-compile regular expressions and pre-join literal pattern segments.
Convert to/from bytes (a minor use-case) in iglob() rather than supporting bytes throughout. This particularly simplifies the code needed to handle relative bytes paths with dir_fd.
Avoid calling os.path.join(); instead we keep paths in a normalized form and append trailing slashes when needed.
Avoid calling os.path.normcase(); instead we use case-insensitive regex matching.

Implementation notes

Much of this functionality is already present in pathlib's implementation of globbing. The specific additions we make are:

Support for dir_fd
Support for include_hidden
Support for generating paths relative to root_dir

Results

Speedups via python -m timeit -s "from glob import glob" "glob(pattern, recursive=True, include_hidden=True)" from CPython source directory on Linux:

pattern	speedup
`Lib/*`	1.67x
`Lib/*/`	1.77x
`Lib/*.py`	1.22x
`Lib/**`	5.3x
`Lib/**/`	1.31x
`Lib/*/`	1.9x
`Lib//`	16.8x
`Lib/*//`	2.22x
`Lib/*/.py`	1.81x
`Lib/**/__init__.py`	1.11x
`Lib/*//*.py`	2.37x
`Lib/*//__init__.py`	1.8x

Issue: gh-116380

Mar 05 '24 23:03 barneygale

Needs a fix for #116377 to land.

Mar 05 '24 23:03 barneygale

Thanks Serhiy! We use os.scandir() if either:

We're expanding a recursive wildcard (we need to distinguish directories in order to recurse)
We're expanding a non-final non-recursive wildcard (we need to select only directories)

If neither of these are true, then we don't need to stat() the children, and so os.listdir() is actually a little faster I think. But I will test this on a few machines to be sure!

edit: to further illustrate what I mean, here's where os.listdir() is used:

	non-recursive part	recursive part
non-terminal part	`os.scandir()`	`os.scandir()`
terminal part	`os.listdir()` <--	`os.scandir()`

Mar 07 '24 12:03 barneygale

The new implementation should be benchmarked with different test cases: deep and wide threes, files and directories domination.

I've been looking into this! The randomfiletree project is helpful - it can repeatedly walk a tree and create child files/folders according to a gaussian distribution, which seems to me like a good approximation for an average "shallow and wide" filesystem structure, including tweaking for file or folder distribution.

It's difficult to produce "deep and narrow" trees this way, as the file/folder probability would need to change with the depth (I think?). I've been considering writing a tree generator that works this way, e.g.:

At depth==0, generate 100 subdirectories
At 0 < depth < 50, generate 1 subdirectory
At depth==50, generate 100 files

... but is that overly arbitrary? Is there a better way? Or do I just need to come up with a bunch of test cases along those lines?

Mar 10 '24 18:03 barneygale

A test of 100 nested directories named "deep" from my Linux machine:

pattern	speedup
`deep/**`	3.86x
`deep/**/`	4.03x
`deep/*/`	4.92x
`deep/*//`	4.93x

Mar 15 '24 01:03 barneygale

It should be possible to implement pathlib.Path.glob() atop this new code, which would provide a tasty speed-up, because intermediate Path objects wouldn't be generated. To do this we'd need to add follow_symlinks and case_sensitive arguments to the internal selector functions.

From there it would be simple to expose follow_symlinks and thereby resolve GH-73661.

Mar 17 '24 16:03 barneygale

Is this too long to review? If so I can try staging it in smaller PRs.

Apr 01 '24 22:04 barneygale

With much faffing, I've managed to unify the pathlib and glob implementations of globbing in this patch. This makes the pathlib implementation faster because it doesn't need to spin up intermediate Path objects. The glob implementation is also much faster, but not quite as fast as in earlier versions of this patch. It all comes out code-negative too!

Apr 06 '24 04:04 barneygale

Marking as a draft as I think this is too large to land. But I have an idea of how to stage it!

Apr 06 '24 16:04 barneygale

I have a plan for landing this incrementally, in a way that should be much more reviewable.

Add a glob._Globber class that implements only pathlib globbing.
- Speeds up pathlib globbing by working with strings internally
Select literal parts with lstat() rather than scandir()
- Speeds up pathlib globbing by reducing syscalls made / work done
Support excluding hidden files internally
- Necessary to support glob.glob()
Support dir_fd internally
- Necessary to support glob.glob()
Switch glob.glob() and iglob() to use _Globber
- See description of this PR for benefits.

First PR is here: #117589

I'll leave this PR open mostly for my own reference :P.

Apr 06 '24 20:04 barneygale

Right, this PR is now about as small as I can make it. It does these things specifically:

Adds support for dir_fd and include_hidden in pathlib._glob.Globber
Makes glob.iglob() (thence glob()), glob0() and glob1() use Globber
- Adds _split_pathname() to explode a path into an anchor and a stack of parts
- Adds _relative_glob() to generate paths that are relative to root_dir
- Deletes private functions that are now unused

I'll update the timings in the PR description momentarily.

May 03 '24 21:05 barneygale

@barneygale Thanks for all the refactoring so far, this version of the PR is indeed much smaller and easier to review. The code looks good to me, although to be honest I have not enough experience to judge whether all the subtle cases are handled correctly.

The globbing is much faster for the relevant cases, but for globbing a non-existent directory the new PR is a bit slower. On my system

python -m timeit -s "from glob import glob" "x=glob('directory_does_not_exist/*', recursive=True, include_hidden=True)"

is 14 usec on main and 16.3 usec with this PR. No reason not to accept this PR, but maybe add it to the benchmarks and investigate later.

I tried to check coverage of the new code and there are a few lines missing coverage. Some missing lines seem fine (I only tested windows, some OSError exceptions), but these stand out:

In pathlib._glob the method open_dir has no coverage for the second and third branch of the if statement
In several methods (e.g. select_recursive or select_wildcards) the code path in if dir_fd is not None: is not reached

May 05 '24 21:05 eendebakpt

Thanks so much for the feedback, that's most helpful!

The globbing is much faster for the relevant cases, but for globbing a non-existent directory the new PR is a bit slower. On my system
python -m timeit -s "from glob import glob" "x=glob('directory_does_not_exist/*', recursive=True, include_hidden=True)"
is 14 usec on main and 16.3 usec with this PR. No reason not to accept this PR, but maybe add it to the benchmarks and investigate later.

Will do. I believe the new code immediately calls os.scandir(), which fails in this case, whereas the old code first calls os.lstat(), which again fails but slightly more quickly. If the directory does exist (which should be the common case), the new code saves an os.lstat().

I tried to check coverage of the new code and there are a few lines missing coverage. Some missing lines seem fine (I only tested windows, some OSError exceptions), but these stand out:

In pathlib._glob the method open_dir has no coverage for the second and third branch of the if statement

In several methods (e.g. select_recursive or select_wildcards) the code path in if dir_fd is not None: is not reached

I think this is because paths relative to directory descriptors aren't supported on Windows, but I'll check!

May 05 '24 22:05 barneygale

Hey @gpshead and @serhiy-storchaka - would you like to review this? If not, I might make an appeal on discuss.python.org in a couple weeks' time. Thanks for the feedback so far, and no worries at all if you'd rather focus your efforts elsewhere!

May 14 '24 18:05 barneygale

Just spotted an issue: previously iglob() returned a generator with a close() method that would os.close() any open file descriptors. I'll get this re-instated.

May 31 '24 20:05 barneygale

@picnixz to address some of your comments on using map() rather than looping and yield: I did this so that calling close() on the iglob(dir_fd=blah) generator causes os.close() to be called on all open file descriptors, which seems to work with a stack of for loops but not map(). But I didn't add a test case - I'll do that now :)

Aug 28 '24 16:08 barneygale

That's... an interesting functionality I wasn't aware of :) If someone could explain to me the reason I'd be happy. Anyway, let's keep your loops.

Aug 28 '24 16:08 picnixz