pathlib: Path.iterdir() is surprisingly not streaming
Bug report
Suprisingly (contrary to its name and being a generator), Path.iterdir() does not stream directory entries:
It reads all directory entries into memory before yielding the first entry.
This can cause excessive memory usage when "iterating" over very large directories.
Users would expect by default that .iterdir() operates in a streaming way, like UNIX find or readdir(), streaming e.g. the results of the underlying system calls such as getdents64() on Linux.
But it does entries = list(scandir_it) instead:
https://github.com/python/cpython/blob/c419af9e277bea7dd78f4defefc752fe93b0b8ec/Lib/pathlib/init.py#L835-L843
This should be documented, and can hopefully be fixed without too much breakage.
CPython versions tested on:
CPython main branch
Operating systems tested on:
Linux
Related issues
- https://github.com/python/cpython/issues/129871
Linked PRs
- gh-136060
I am not very fond of fixing it since it is not an optimization (it might change the behaviour a lot) and will effect the backward compatibility here.
And the doc change, I suggest that we can mention this but maybe not as a warning, I think it's just not serious enough
Another drawback of current .iterdir() is that it is uninterruptible, e.g. Ctrl+C during a large .iterdir() has no effect.
Ctrl+C works well during a proper streaming os.scandir().
I do not know where this difference comes from, as theoretically list() is also just reading from the proper streaming os.scandir(); maybe list() is implemented in a low-level uninterruptible way?
I am not very fond of fixing it since it is not an optimization (it might change the behaviour a lot) and will effect the backward compatibility here.
Adding another function with a new name and different semantics might be the better option here.
I think I agree - that line 843 is a bad code smell. It would be better to consume that iterable lazily, irrespective of the ability to interrupt the operation. I'd be interested to know what behaviors rely on that eager consumption of the iterable.
Acknowledged it may be difficult to handle the backward compatibility concerns about this behavior.
In my opinion, it's a bug that a method called iterdir() isn't yielding the items lazily, based on its namesake and documentation. I'd rather see this behavior fixed than simply documented.
I believe this behavior was actually changed in 3.13 with https://github.com/python/cpython/pull/117728. Previously iterdir was lazy via returning a generator. But it was changed to materialize all entries at once. Since I think we should revert the semantics change due to breaking anyio (mentioned in #129871), it seems that we should revert to the pre-3.13 behavior which is what @nh2 expects.
I think I agree - that line 843 is a bad code smell. It would be better to consume that iterable lazily, irrespective of the ability to interrupt the operation. I'd be interested to know what behaviors rely on that eager consumption of the iterable.
Acknowledged it may be difficult to handle the backward compatibility concerns about this behavior.
In my opinion, it's a bug that a method called
iterdir()isn't yielding the items lazily, based on its namesake and documentation. I'd rather see this behavior fixed than simply documented.
The old version before #117728 used os.listdir which also eagerly reads the directory, line 843 maintains the semantics of that. The actual pathlib.Path instances are still created lazily. The main difference w.r.t. laziness is that the old version only called os.listdir when actually consuming the iterator, while the new implementation always calls os.scandir.
I guess there's no real good choice here: the current behaviour uses more memory when procesing large directories. Lazily processing the result from os.scandir uses more open file descriptors when processing nested filesystem structures (e.g. in a naive reimplementation of os.walk). The latter can also run out of file descriptors when the result from iterdir is kept alive without fully consuming it.
The old version before #117728 used
os.listdirwhich also eagerly reads the directory
Yes.
I guess there's no real good choice here: the current behaviour uses more memory when procesing large directories. Lazily processing the result from
os.scandiruses more open file descriptors when processing nested filesystem structures (e.g. in a naive reimplementation ofos.walk).
@ronaldoussoren Is there really no good choice? It seems simple to me:
If somebody wants to implement an os.walk, they only have 2 general design alternatives:
a. Keep a dir FD open per depth level b. Read a dir into memory, saving the FD and spending RAM
For (a) they would use a fixed .iterdir(), and for (b) they would call list() on it to force it to completion.
Like with any other Python iterator. Mixing up these two choices is a mistake that also applies to any other Python iterator that keeps open resources.
So I think that distinction should be in the docs, instead of .iterdir() staying broken for that reason. Or there could be Path.listdir() that does the list() for the user. Or, for backwards compat, .iterdir() could be the one that does the list(), and a new function (iterdir_streaming()) could be the fixed one.
I want to point out that the current implementation is unnecessarily memory-inefficient as well:
with os.scandir(root_dir) as scandir_it:
entries = list(scandir_it)
if root_dir == '.':
return (self._from_dir_entry(e, e.name) for e in entries)
else:
return (self._from_dir_entry(e, e.path) for e in entries)
If for each entry e we yield only the .name or path parts anwyay, why do we keep the entire entries around? Might as well do:
with os.scandir(root_dir) as scandir_it:
entries = list(scandir_it)
if root_dir == '.':
return [self._from_dir_entry(e, e.name) for e in entries]
else:
return [self._from_dir_entry(e, e.path) for e in entries]
This reduces memory usage, becaue the other fields fields of each entry are discareded early (entries is freed).
But now .iterdir() doesn't return any iterator any more, just a list.
And that is right: It never makes sense to keep a thing in memory to just yield smaller partial views into that thing, and then discard the thing afterwards. That is always memory waste; returning the partial view fully materialised is always better. That is, any function of form
def fun():
thing = list(...)
for x in list:
yield projective_function(x)
does not make sense and be better written as
def fun():
return [projective_function(x) for x in list(...)]
(The only situation where this does not apply is when the the partial views are larger for some reason, e.g. when thing is some packed bit array in C and the yields are yielding Python booleans which are 64x larger.)
So I'd claim the whole concept of the current .iterdir() is flawed, because its "iterator" part is pointless and it should just have been .listdir() instead, returning a list.
This is intentional. scandir() opens a file descriptor, so there are two issues:
- If iteration was interrupted, closing of a file descriptor can be delayed (especially in non-refcount based Python implementations) and is not guaranteed.
- If it is called recursively, it can open arbitrary number of file descriptors, and this is a limited resource.
This is why the scandir() iterator is consumed eagerly in glob.iglob(), os.walk(), shutil.rmtree() and Path.iterdir().
See also #66363. We should use similar wording in documentation of glob.iglob() and Path.iterdir().
@serhiy-storchaka
This is intentional.
It should probably be commented somewhere in the code, otherwise it looks like a bug.
Your argument about the easy caller mistake in non-refcount based Python makes sense (there should still be another function that does it in an iterative way though, because otherwise pathlib is less powerful than os.scandir()).
Do you have an opinion on my point above why it does not return a list instead of an iterator, as that should be strictly better in all cases? (https://github.com/python/cpython/issues/136059#issuecomment-3016324428)
So I'd claim the whole concept of the current
.iterdir()is flawed, because its "iterator" part is pointless and it should just have been.listdir()instead, returning a list.
AFAIK it's done this way because Path objects are somewhat expensive to instantiate (though they're cheaper nowadays). This is the same reason for PurePath.parents being a custom sequence rather than a tuple.