pytest-xdist icon indicating copy to clipboard operation
pytest-xdist copied to clipboard

total worker initialization time scales linearly to the number of workers

Open dwiel opened this issue 7 years ago • 15 comments

It seems that currently, initializing workers is blocking and done sequentially:

    def setup_nodes(self, putevent):
        self.config.hook.pytest_xdist_setupnodes(config=self.config, specs=self.specs)
        self.trace("setting up nodes")
        # from multiprocessing import Pool
        # p = Pool(len(self.specs))
        # nodes = p.map(lambda spec: self.setup_node(spec, putevent), self.specs)
        nodes = []
        for spec in self.specs:
            nodes.append(self.setup_node(spec, putevent))
        return nodes

This means that starting a large number of workers even on a single machine with a large number of cpu cores takes a long time. For example, it takes about 45 seconds to start one worker per cpu core on my machine with 88 cores. The effect is even more extreme when workers are being started on remote machines with additional network latency added.

With the advent of a larger number of cores per machine and more frequent access to large clusters of machines, it would be nice to quickly horizontally scale to a large number of workers, even in cases where this would be the difference between 5 minutes on 8 workers versus 8 seconds on 300 workers. Obviously it isn't quite that simple, but there is clearly room theoretically for improvement.

As you can see from the code posted above, i've tried using multiprocessing to paralyze the setup of new nodes, however, the execnet objects used aren't pickleable so this naïve solution does not work.

I've also spent a little bit of time investigating the use of something like ray for rapid distribution across an admittedly homogenous cluster, but i ran into trouble where Function objects were not pickleable even by dill and cloudpickle.

Has anyone looked into how else this problem could be solved? Perhaps my identification of the problem is also incorrect. Are there other critical factors that are preventing the use of a large number of workers?

dwiel avatar Sep 27 '18 20:09 dwiel

this needs either a fix in execnet, or work on supporting multiprocessing/mitogen as a backend

RonnyPfannschmidt avatar Sep 27 '18 20:09 RonnyPfannschmidt

what do you think about the work involved in supporting alternative backends? Everything seems fairly tightly coupled to execnet right now, though perhaps not as much as i think.

dwiel avatar Sep 27 '18 20:09 dwiel

i didnt even start with the initial analysis but i did decide to stop working on execnet myself

RonnyPfannschmidt avatar Sep 27 '18 20:09 RonnyPfannschmidt

any progress on this?

programmerjake avatar Jul 03 '19 09:07 programmerjake

nope

RonnyPfannschmidt avatar Jul 03 '19 10:07 RonnyPfannschmidt

this seems like it would be amenable to doing a concurrent future thread pool for the setup, assuming the underlying setup_node is thread safe?

kapilt avatar Feb 04 '20 07:02 kapilt

there currently is no analysis on that

RonnyPfannschmidt avatar Feb 04 '20 08:02 RonnyPfannschmidt

based on a brief look however i would guess that it is absolutely not thread safe

RonnyPfannschmidt avatar Feb 04 '20 08:02 RonnyPfannschmidt

Yeah thats the problem with the current method. A more fundamental change would be required to make it thread safe.

dwiel avatar Feb 04 '20 15:02 dwiel

That problem make xdist quite inconvenient to use. I often endup with xdist being slower on machine with more cores than running without parallel in serial. Any workarounds?

ssbarnea avatar Jul 07 '21 08:07 ssbarnea

No, there's currently no way

RonnyPfannschmidt avatar Jul 07 '21 11:07 RonnyPfannschmidt

Hey gang - is there any planned work around this? It would be amazing if it could yield each worker once set up

WittierDinosaur avatar Apr 17 '23 13:04 WittierDinosaur

this needs some work in execnet, which is currently bus-factored on me being on parernity leave

RonnyPfannschmidt avatar Apr 17 '23 13:04 RonnyPfannschmidt

Ah, that's very fair. I'm assuming this isn't the kind of thing someone can pick up quickly?

WittierDinosaur avatar Apr 17 '23 16:04 WittierDinosaur

It's probably possible, but needs some gut digging, a hack to use a thread pool may be enough

RonnyPfannschmidt avatar Apr 17 '23 16:04 RonnyPfannschmidt