Algorithm
Thank you for the great project!
I wanted to ask couple of related questions.
- I noticed that the
hash(item.nodeid)doesn't give a good distribution on my workloads. I.e. if I have 8 tests, I often end up with one shard getting 6 and another shard getting 2. - Seems like
item.nodeidhas an absolute path to the test file, so it could be machine-specific (depending on the implementation). This makes it possible for a single test to run twice and vice-verse for a test not to be run at all.
So my question is -- can we use a simple round-robin algorithm? Something like
def filter_items_by_shard(
items: Iterable[nodes.Node], shard_id: int, num_shards: int
) -> Sequence[nodes.Node]:
"""Computes `items` that should be tested in `shard_id` out of `num_shards` total shards."""
return [item for idx, item in enumerate(items) if idx % num_shards == shard_id]
Is there any downside in it?
My understanding is that pytest has a deterministic order for the items, so it should be just fine.
I wanted to ask couple of related questions.
- I noticed that the
hash(item.nodeid)doesn't give a good distribution on my workloads. I.e. if I have 8 tests, I often end up with one shard getting 6 and another shard getting 2.
Yep, that's a definite possibility, it's not doing any explicit load balancing so you need a large number of tests for it to be equally balanced.
- Seems like
item.nodeidhas an absolute path to the test file, so it could be machine-specific (depending on the implementation). This makes it possible for a single test to run twice and vice-verse for a test not to be run at all.
I'm not sure I understand the problem here. Surely the absolute path will be the same amongst workers in any given invocation? So it might be sharded in a different way across machines, but we should never have tests running multiple times or not at all.
So my question is -- can we use a simple round-robin algorithm? Something like
def filter_items_by_shard( items: Iterable[nodes.Node], shard_id: int, num_shards: int ) -> Sequence[nodes.Node]: """Computes `items` that should be tested in `shard_id` out of `num_shards` total shards.""" return [item for idx, item in enumerate(items) if idx % num_shards == shard_id]Is there any downside in it?
The main downside I can see is if tests have some alternating pattern which causes expensive tests to consistently get allocated on one worker. E.g. if you have many tests that are parameterized with fast and slow versions and you have two workers, the fast version will always be on the first worker, slow version on second worker.
I expect round-robin would work fine for lots of workloads though. I'd be in favour of adding this as an option but keeping hashing available for users who want it. Feel free to open a PR!
I have a bit of different use case for sharding: I use it with bazel's test sharding; mainly to prevent both bazel and pytest-xdist from both trying to parallelize, resulting in N² processes (and thus running out of memory on systems with high core count).
I have found that in some cases (where there are only twice as many tests as shards) the current algorithm results in some shards having 0 tests, which results in failure (due to pytest exiting with error if no tests are selected). This only occurs on some developer's systems, because it depends on the hashing of the absolute paths. (and yes, this particular set of tests is slow enough that it is worth having almost one shard per test)
I'm also afraid that bazel remote execution might run pytest with different absolute paths on different workers, resulting in some tests being missed.
So far I've just applied the suggested change without introducing an option. If you have any suggestion for the option name (and what the default behavior should be), I can create a PR.
Hi @dgrunwald-qt just saw your comment. Yes, my motivation also came from bazel sharding :) I being using forked version with quite a bit of success in our monorepo.