pytest-xdist
pytest-xdist copied to clipboard
1184.feature - Add 'singlecollect' distribution mode
This adds a new 'singlecollect' distribution mode that only collects tests on the first worker node and skips redundant collection on other nodes. This can significantly improve startup time for large test suites with expensive collection.
Key features:
- Only the first worker performs test collection
- Other workers skip collection verification entirely
- Tests are distributed using the same algorithm as 'load' mode
- Handles worker failures gracefully, including the collecting worker
- Solves issues with floating parameters in pytest collection
This adds a new 'singlecollect' distribution mode that only collects tests on the first worker node and skips redundant collection on other nodes.
You say this skips redundant collection on other nodes, but I don't think that's right -- how can a node execute a test (Item) it hasn't collected? Can you be more precise what is being saved here? Is it just the _check_nodes_have_same_collection?
@bluetech When using pytest's parametrization with unordered collections like sets, each worker in pytest-xdist independently collects tests. Since sets don't maintain a consistent iteration order, this results in different workers seeing different test orderings.
For example:
@pytest.mark.parametrize("value", {1, 2, 3, 4, 5}) # Using a set
def test_example(value):
assert value > 0
Worker 1 might collect: test_example[1], test_example[3], test_example[2]... Worker 2 might collect: test_example[2], test_example[1], test_example[5]... This inconsistency triggers pytest-xdist's collection verification, which detects the mismatch between workers and aborts the test run with an error message like:
Different tests were collected between gw0 and gw1
This forces developers to either avoid using unordered collections in parametrization or manually convert them to ordered sequences.
The workers still need to have the test items to execute them. What's happening in the singlecollect mode is:
- The first worker performs the full pytest collection process (discovering test files, importing modules, creating test items)
- The master node receives this collection and distributes the test items to all workers
- Other workers do not perform their own independent collection
- The collection verification step (_check_nodes_have_same_collection) is skipped
The key part in the code is:
def add_node_collection(self, node: WorkerController, collection: Sequence[str]) -> None:
"""Only use collection from the first node."""
# We only care about collection from the first node
if node == self.first_node:
self.log(f"Received collection from first node {node.gateway.id}")
self.collection = list(collection)
self.collection_done = True
else:
# Skip collection verification for other nodes
self.log(f"Ignoring collection from node {node.gateway.id}")
Other nodes receive tests from the master node during test distribution.
@RonnyPfannschmidt More problem-focused naming can be something like unordered-params. I chose singlecollect based on the mechanism of tests collection.
If the only difference for this is ignoring different test order on other nodes then it is a completely unacceptable no go as shedulers currently talk in terms of indexes into the collection
I understand your concern about schedulers using collection indexes. May I ask what the specific benefit is of having each worker collect tests independently? This PR maintains index-based scheduling while using a single collection source. Workers still receive and execute tests by index, just using the first worker's collection as the reference. Would a mode with a single collector node be a viable solution? It would solve the unordered parameter issue while preserving the existing scheduling mechanism.
Theres a number of edge cases to be aware of
- nodeid is not unique as one might think theres numerous usecases where one will actually see duplicate node ids
- sometimes nodeids differ between nodes due to misstakes in test definition
Come to mind off hand