pytest-xdist icon indicating copy to clipboard operation
pytest-xdist copied to clipboard

1184.feature - Add 'singlecollect' distribution mode

Open zorhay opened this issue 9 months ago • 5 comments

This adds a new 'singlecollect' distribution mode that only collects tests on the first worker node and skips redundant collection on other nodes. This can significantly improve startup time for large test suites with expensive collection.

Key features:

  • Only the first worker performs test collection
  • Other workers skip collection verification entirely
  • Tests are distributed using the same algorithm as 'load' mode
  • Handles worker failures gracefully, including the collecting worker
  • Solves issues with floating parameters in pytest collection

zorhay avatar Mar 03 '25 20:03 zorhay

This adds a new 'singlecollect' distribution mode that only collects tests on the first worker node and skips redundant collection on other nodes.

You say this skips redundant collection on other nodes, but I don't think that's right -- how can a node execute a test (Item) it hasn't collected? Can you be more precise what is being saved here? Is it just the _check_nodes_have_same_collection?

bluetech avatar Mar 04 '25 14:03 bluetech

@bluetech When using pytest's parametrization with unordered collections like sets, each worker in pytest-xdist independently collects tests. Since sets don't maintain a consistent iteration order, this results in different workers seeing different test orderings.

For example:

@pytest.mark.parametrize("value", {1, 2, 3, 4, 5})  # Using a set
def test_example(value):
    assert value > 0

Worker 1 might collect: test_example[1], test_example[3], test_example[2]... Worker 2 might collect: test_example[2], test_example[1], test_example[5]... This inconsistency triggers pytest-xdist's collection verification, which detects the mismatch between workers and aborts the test run with an error message like:

Different tests were collected between gw0 and gw1

This forces developers to either avoid using unordered collections in parametrization or manually convert them to ordered sequences.

The workers still need to have the test items to execute them. What's happening in the singlecollect mode is:

  1. The first worker performs the full pytest collection process (discovering test files, importing modules, creating test items)
  2. The master node receives this collection and distributes the test items to all workers
  3. Other workers do not perform their own independent collection
  4. The collection verification step (_check_nodes_have_same_collection) is skipped

The key part in the code is:

def add_node_collection(self, node: WorkerController, collection: Sequence[str]) -> None:
    """Only use collection from the first node."""
    # We only care about collection from the first node
    if node == self.first_node:
        self.log(f"Received collection from first node {node.gateway.id}")
        self.collection = list(collection)
        self.collection_done = True
    else:
        # Skip collection verification for other nodes
        self.log(f"Ignoring collection from node {node.gateway.id}")

Other nodes receive tests from the master node during test distribution.

@RonnyPfannschmidt More problem-focused naming can be something like unordered-params. I chose singlecollect based on the mechanism of tests collection.

zorhay avatar Mar 15 '25 00:03 zorhay

If the only difference for this is ignoring different test order on other nodes then it is a completely unacceptable no go as shedulers currently talk in terms of indexes into the collection

RonnyPfannschmidt avatar Mar 15 '25 07:03 RonnyPfannschmidt

I understand your concern about schedulers using collection indexes. May I ask what the specific benefit is of having each worker collect tests independently? This PR maintains index-based scheduling while using a single collection source. Workers still receive and execute tests by index, just using the first worker's collection as the reference. Would a mode with a single collector node be a viable solution? It would solve the unordered parameter issue while preserving the existing scheduling mechanism.

zorhay avatar Mar 15 '25 09:03 zorhay

Theres a number of edge cases to be aware of

  • nodeid is not unique as one might think theres numerous usecases where one will actually see duplicate node ids
  • sometimes nodeids differ between nodes due to misstakes in test definition

Come to mind off hand

RonnyPfannschmidt avatar Mar 15 '25 12:03 RonnyPfannschmidt