Distributed.jl icon indicating copy to clipboard operation
Distributed.jl copied to clipboard

Setup worker-worker connections lazily

Open amitmurthy opened this issue 8 years ago • 7 comments

The default all_to_all topology connects all processes to each other. While this is fine for small clusters, the total number of TCP connections increases rapidly as (N^2)/2.

Considering that a large class of parallel problems only need master-worker connections we should change the default topology to all_to_all_lazy where worker-worker connections are setup only on the first request from a worker to another worker. And also introduce another topology master_routed which only connects master to workers, and in case of a worker-worker call, routes the request through the master.

To summarize, implement 2 new topologies:

  1. all_to_all_lazy where worker-worker connections are setup lazily, and is the default for addprocs and

  2. master_routed in which only the master connects to workers and worker-worker messages are routed via the master.

amitmurthy avatar May 20 '17 06:05 amitmurthy

This would solve major connection time issues on large clusters that we have repeatedly seen.

ViralBShah avatar Jul 18 '17 09:07 ViralBShah

Just wanted mention that it also seemed that https://github.com/JuliaLang/julia/pull/22588 made adding remote workers noticeably faster.

andreasnoack avatar Jul 18 '17 11:07 andreasnoack

I wonder how and why JuliaLang/julia#22588 affected worker startup time. @vtjnash ?

amitmurthy avatar Jul 18 '17 11:07 amitmurthy

@andreasnoack / @ViralBShah care to comment on the interface for lazy connection setup in JuliaLang/julia#22814?

amitmurthy avatar Jul 18 '17 11:07 amitmurthy

Sorry for the noise here. Just did some more systematic timings and my previous impression must have been based on differences in the connection.

andreasnoack avatar Jul 18 '17 15:07 andreasnoack

Bump – are we still planning on doing this?

StefanKarpinski avatar Aug 30 '17 17:08 StefanKarpinski

bump

bisraelsen avatar Apr 18 '18 15:04 bisraelsen