pytest
pytest copied to clipboard
How should package collection work?
I am working on cleaning up our collection code, but the current behavior seems odd and incidental, therefore I'd first like to discuss how it should behave.
pytest and operating system versions: pytest 6.0.2/master on Linux.
Current behavior
Consider the following filesystem tree:
a
├── b
│ ├── c
│ │ ├── d
│ │ │ ├── __init__.py
│ │ │ └── test_d.py
│ │ └── test_c.py
│ ├── __init__.py
│ └── test_b.py
├── b2
│ ├── __init__.py
│ └── test_b2.py
├── __init__.py
└── test_a.py
This has several nested packages, but note that the c level doesn't have an __init__.py (just to make it more interesting).
This results in the following collection tree:
$ pytest --co a/
collected 5 items
<Package a>
<Module test_a.py>
<Function test_a1>
<Package b>
<Module test_b.py>
<Function test_b1>
<Module a/b/c/test_c.py>
<Function test_c1>
<Package d>
<Module test_d.py>
<Function test_d1>
<Package b2>
<Module test_b2.py>
<Function test_b21>
The Packages are all flat, not nested. Namespace packages are not considered Packages. Files not inside packages are collected as standalone Modules.
This however does not reveal the real story. See what happens when --keep-duplicates is used:
$ pytest --co --keep-duplicates a/
collected 11 items
<Package a>
<Module test_a.py>
<Function test_a1>
<Package b>
<Module test_b.py>
<Function test_b1>
<Module test_c.py>
<Function test_c1>
<Package d>
<Module test_d.py>
<Function test_d1>
<Package b2>
<Module test_b2.py>
<Function test_b21>
<Package b>
<Module test_b.py>
<Function test_b1>
<Module test_c.py>
<Function test_c1>
<Package d>
<Module test_d.py>
<Function test_d1>
<Module a/b/c/test_c.py>
<Function test_c1>
<Package d>
<Module test_d.py>
<Function test_d1>
<Package b2>
<Module test_b2.py>
<Function test_b21>
This has all of the previous collectors, but also duplicates which are nested under each Package.
Technical details
For this interested, the code details are as follows:
pytest has two recursive filesystem collectors, Session and Package.
Session.collect() walks the entire trees (of the given command line arguments) recursively in BFS order and creates collectors for Packages and Modules. It has various obscure code to special-case Packages and exclude files belonging directly to the package. The Collectors it creates always have the Session as parent (i.e. flat).
Note that the Collectors themselves are only recursively expanded to Items after the above is finished. (This step is done by Session.genitems() which calls each Collector's own collect() method).
Package.collect() also walks the package's directory recursively. It has some code to try to avoid collecting Modules belonging to sub-packages, but otherwise creates Modules and Packages with itself as parent.
Since the stuff collected by Packages was already collected by the Session (with the exception of a package's own direct files) they are mostly discarded as duplicates unless --keep-duplicates is used.
Question
My question is, what do we want to happen?
Evidently the nesting is not super important, otherwise we would have heard loud complaints by now (though there are some issues about this). But the nesting does have an effect on how package-scope fixtures are applied - reuse from super-package or not?
The duplication seems definitely broken.
And there is a question on how PEP 382 namespace packages fit into this (if it all).
Would be happy for any thoughts!
I'd like to have a more detailed discussion about structuring the collection tree, we accrued a mess that requires some decomposition to begin with
Thanks @bluetech for putting up this detailed description.
Indeed the Package node was not thought through before being included, and I'm at fault for not giving the attention it deserved at the time.
Evidently the nesting is not super important, otherwise we would have heard loud complaints by now (though there are some issues about this). But the nesting does have an effect on how package-scope fixtures are applied - reuse from super-package or not?
I think packages should be nested, just like the other nodes. I'm surprised that no one noticed this inconsistency before (including myself). Then package-scoped fixtures follow the normal rules that other fixtures follow.
The question about namespace packages unfortunately is not that simple. Besides being a pain to detect at runtime (at least that I recall the discussions around that subject), we need to wonder how they would be represented given that in the same namespace package, sub-packages might be located in completely different directories, and the collection tree is directory-based. In other words, in your example if we collect a/b/c as a namespace package, what happens if some other c/foo package exists in a directory outside of the collection root a?
I think however we can postpone what to do with namespace packages to some other time, and focus on natural packages (those with __init__.py) instead for now.
I'd like to have a more detailed discussion about structuring the collection tree, we accrued a mess that requires some decomposition to begin with
Good idea @RonnyPfannschmidt. You want to do that here or somewhere else? I think here is fine IMHO.
Found this issue while trying to figure out why my nested test packages all report as top level packages. I, for one, if I went through the trouble to nest them, I want them to be nested when reported. Is there a way to force that to happen right now for my situation as a workaround? I've tried so many things that I'm at a loss.
Found this issue while trying to figure out why my nested test packages all report as top level packages. I, for one, if I went through the trouble to nest them, I want them to be nested when reported. Is there a way to force that to happen right now for my situation as a workaround? I've tried so many things that I'm at a loss.
Ok, so I wrote a plugin as a workaround for what I needed since I couldn't find anything else. Code is MIT license. PyPI: https://pypi.org/project/pytest-prefer-nested-dup-tests/ github: https://github.com/MarximusMaximus/pytest-prefer-nested-dup-tests
Hopefully it's useful to other people!
Possible Plan
Looked at this again.
The main trouble here stems from the fact that we have two recursive collectors -- Session and Package -- which step on each other toes in unexpected ways, and it's very hard to fix.
I'm increasingly convinced that we shouldn't have any recursive collector. Huh? Well, if you think about it, we already have a concept to handle the recursion for us -- the collection tree. The collectors are already recursively expanded on their own, that is, a Collector can collect other Collectors, which are then collected themselves, etc. until we get only Items (for reference, see Session.genitems, which is literally implemented with simple recursion).
What does it mean in practice? At a high level:
- Introduce a new
Directorycollector node. - Remove recursion (filesystem walking) from
SessionandPackage. - A directory is collected like this: if the directory contains an
__init__.pyfile, create aPackage, otherwise aDirectory. Sessioncollects directories from the collection arguments (paths given on the command line/testpaths).PackageandDirectorycollect like this: scan their directory: if the entry is a file, collect it like today; if it's a directory, collect it like as above.
This looks like this:
<Session>
<Package a>
<Package b>
<Directory c>
<Package d>
<Module test_d.py>
<Function test_d>
<Module test_c.py>
<Function test_c>
<Module test_b.py>
<Function test_b>
<Package b2>
<Module test_b2.py>
<Function test_b2>
<Module test_a.py>
<Function test_a>
Some points:
-
<Directory c>is what I call a "graft" - a non-package directory inside of a package directory. There are two options: nest it under<Package b>, or nest it directly underSessionas an independent sub-tree. This mostly has implications forpackagescope fixtures. -
Seems like in very old versions of
pytestthere existed aDirectorynode, but it was removed in 6dac77433eb139247f94ac07c6caf6784b341a70. It's too hard to try and figure out why unfortunately. -
One thing I need to think about is file collection arguments (i.e.
pytest foo/bar/test_it.py). We need to have collection nodes forfooandfoo/barin this case (be theyDirectorys orPackages), but ignore irrelevant paths (e.g.foo/test_baz.py). This is currently handled bySession, but can't be under the new scheme (I think. Maybe the "matchnodes" mechanism is good enough for this). -
This will have to go in a major release as a breaking change.
I have a very simple POC implementation, but there are a bunch of details to figure out. Interested to hear opinions. I vaguely remember @RonnyPfannschmidt mentioned a Directory node, but I can't find it and might be misremembering.
It's been more than a decade, that directory node existed
It was undone for good reasons, but I think it is reasonable to reintroduce it
It was undone for good reasons
Do you remember what they were? I do suspect there is some issue here; the Directory solution is pretty natural and harmonious with the entire collection tree concept, that I'm sure it's been tried. If there's something fundamental I'm sure I'll run into it eventually but sooner is better than later :)
Pytest prior to the refactoring back then had no concept of separation of test running and test execution
Tests where executed as found
While changing the details, it was simpler to do collect towards if there where only files and session
Suggestion regarding the design above: Directory nodes wrap Packages that are not inside other Packages, such as:
<Session>
<Directory a>
<Package a>
<Package b>
<Directory c>
<Package d>
<Module test_d.py>
<Function test_d>
<Module test_c.py>
<Function test_c>
<Module test_b.py>
<Function test_b>
<Package b2>
<Module test_b2.py>
<Function test_b2>
<Module test_a.py>
<Function test_a>
<Directory z>
<Package z>
...
<Module z>
...
<Directory y>
<Directory y2>
<Module y3> (single explicit file path case: ./y/y2/y3.py)
I feel like this (1) makes it clear that a session can be multiple directories (since it can already be multiple "top level" packages, this is roughly equivalent), and (2) Should(tm) make it easier to track file system scoping b/c it Should(tm) only need to find the closest parent Directory node.
NOTE: The suggestion implies that we would SKIP Directory Nodes for sub-packages and modules BUT include directory nodes for "grafted sub-packages". In other words, (I think) only add a Directory Node if the directory lacks an init.py (grafted case) OR does not have a parent package (top level packages).
@MarximusMaximus right. I'm not entirely sure how it would look like exactly, but that's the basic idea - Directory is a non-package directory, and Package is a yes-package directory (really the __init__.py file); and we mirror the filesystem hierarchy with either Directory or Package nodes.
In case I get hit by a bus, my WIP branch is here: https://github.com/bluetech/pytest/commits/pkg-collect
It seems viable for sure, but very delicate.
Package should not be a Module
This comment was extracted to #11137
pytest_collect_directory hook
After the changes proposed in this issue, all of Session, Package and Directory would need to collect directories. It basically looks like this:
if (path / "__init__.py").is_file:
return Package.from_parent(parent, path=path)
else:
return Directory.from_parent(parent, path=path)
I don't like that this logic is hardcoded, and repeated in 3 places.
I propose to add a new hook, pytest_collect_directory, similar to the existing pytest_collect_file, but for directories. There will be 2 implementations:
- An implementation in
main.pywill createDirectorynodes. - An implementation in
pytest.pywill createPackagenodes, if the directory has an__init__.pyfile.
This has the following benefits:
- Avoids the repetition.
- Decouples the
pythonplugin - nowmaindoesn't need to know aboutPackageanymore, thepythonplugin plugsPackageitself. - Plugins can invent new features involving directories. For example, a
YamlDirectorywhich collects files based on some manifest, or a virtual filesystem, etc.
One question is whether the hook should be firstresult or not. The pytest_collect_file hook is not, which means multiple plugins can generate multiple nodes for a single file. But firstresult=False creates an issue for pytest_collect_directory -- main doesn't want to create a Directory if a Package is created. So I'm going to define pytest_collect_directory as firstresult=True, which means only one plugin creates a node for a directory.
POC
This is currently implemented as part of the bigger change here:
https://github.com/bluetech/pytest/commits/pkg-collect
This branch doesn't pass all tests yet, but I don't foresee issues with pytest_collect_directory in particular.
Amusingly I just realized that pytest_collect_directory already existed up to pytest 6.0, but was removed because it was broken: deprecation doc, deprecation issue, deprecation pr, removal pr. So I propose to bring it back, with a different definition (py.path -> pathlib, returns the node instead of just being called??), but this time it will actually work and be useful :)
After a lot of tinkering with obscure pytest behaviors (keepduplicates, conftest exceptions, eek), I managed to make the branch pass all tests. I'll now be working on preparing it for submission (split commits, update docs, add tests, add changelogs, finding relevant issues etc.). I'll be proposing it for pytest 8.0.
Abstract Directory node, parent of Dir and Package
One change I've made from the initial proposal is to add an abstract nodes.Directory node, which is analogous to nodes.File for files, and renamed the concrete Directory node to Dir to avoid the naming conflict.
Reminder: nodes.File is the parent of python.Module, but also used for non-Python tests, like YamlFile etc.
Since now we have a proper collector hierarchy mirroring the filesystem hierarchy, I've decided to add an abstract nodes.Directory, which is a parent of the concrete python.Package (directories with __init__.py) and main.Dir (directories without __init__.py).
I think it's nice because:
-
Plugins can invent their own
Directorynodes, like they can forFiles. -
It further decouples the python plugin, now can replace explicit checks for
python.Package(where it stands for "a directory") tonodes.Directory. -
We can replace the
packagescope, which is tied topython.Package, with adirectoryscope which works for anyDirectory. Why not allow non-Packagedirectories to have directory-level fixtures? (This will be follow up work if we think it's a good idea). -
I think it would be good to tighten
pytest_collect_directoryreturn type tonodes.Directory, to make clear the expectations from this hook. (Side note: would also be good to makepytest_collect_fileneed to returnnodes.Fileinstead of an arbitraryCollector...).
Order of files vs. directories in collection
Previously, files in a directory were collected before sub-directories in the directory. That is, given a filesystem tree
top/
├── aaa
│ └── test_aaa.py
├── test_a.py
├── test_b
│ └── test_b.py
├── test_c.py
└── zzz
└── test_zzz.py
would collect as
<Module top/test_a.py>
<Function test_it>
<Module top/test_c.py>
<Function test_it>
<Module top/aaa/test_aaa.py>
<Function test_it>
<Module top/test_b/test_b.py>
<Function test_it>
<Module top/zzz/test_zzz.py>
<Function test_it>
After my changes, the order that naturally flows from the code is that files and sub-directories are orderer jointly, that is the tree is
<Dir pytest>
<Dir top>
<Dir aaa>
<Module test_aaa.py>
<Function test_it>
<Module test_a.py>
<Function test_it>
<Dir test_b>
<Module test_b.py>
<Function test_it>
<Module test_c.py>
<Function test_it>
<Dir zzz>
<Module test_zzz.py>
<Function test_it>
For now I am keeping the joint ordering, but let me know if you think ordering files before subdirs is better.
How about leaving the recursive FS Walk to a single utility that's collects directory/package/file nodes and provides the correct parents
The collect function of each node could then returns the items/definitions within
This would restrict Directory collectors to only handle their files. We not give them control over the sub-directories? What's the advantage?
No id actually like to go further and leave the scanning of the files tree in the hand off pytest
I'd strictly want to avoid a situation where multiple collectors recursively have to cooperate to scan all the files
Walking the file tree and mapping directories and files to nodes ought to be handled in a single place
I understand, but why do you want it to be handled in one place? My idea is to give control to plugins (i.e. custom nodes, including our own, e.g. python, which currently requires hardcoding in Session, i.e. is not independent). That's more pytest-y IMO.
Any extension that implements file collection Will have to share implementations of selection with pytest, which implies exposing the apis for correctly handling it,and thrusting plugin to correctly use it
Based on how people implemented things that doesn't seem safe
Any extension that implements file collection Will have to share implementations of selection with pytest, which implies exposing the apis for correctly handling it,and thrusting plugin to correctly use it
Based on how people implemented things that doesn't seem safe
I agree with this concern. I somewhat procrastinated on this comment, but I tried various things, and in the end I don't have a solution that alleviates this concern without making the hook/custom directory collector mostly useless. So instead of holding up pytest 8 for even longer, I decided to submit what I have (PR #11646), which I think is pretty good, and hope that plugins do the right thing. I tried to document the expectations in the Directory documentation.