Distributed.jl Setup worker-worker connections lazily

The default all_to_all topology connects all processes to each other. While this is fine for small clusters, the total number of TCP connections increases rapidly as (N^2)/2.

Considering that a large class of parallel problems only need master-worker connections we should change the default topology to all_to_all_lazy where worker-worker connections are setup only on the first request from a worker to another worker. And also introduce another topology master_routed which only connects master to workers, and in case of a worker-worker call, routes the request through the master.

To summarize, implement 2 new topologies:

all_to_all_lazy where worker-worker connections are setup lazily, and is the default for addprocs and
master_routed in which only the master connects to workers and worker-worker messages are routed via the master.

May 20 '17 06:05 amitmurthy

This would solve major connection time issues on large clusters that we have repeatedly seen.

Jul 18 '17 09:07 ViralBShah

Just wanted mention that it also seemed that https://github.com/JuliaLang/julia/pull/22588 made adding remote workers noticeably faster.

Jul 18 '17 11:07 andreasnoack

I wonder how and why JuliaLang/julia#22588 affected worker startup time. @vtjnash ?

Jul 18 '17 11:07 amitmurthy

@andreasnoack / @ViralBShah care to comment on the interface for lazy connection setup in JuliaLang/julia#22814?

Jul 18 '17 11:07 amitmurthy

Sorry for the noise here. Just did some more systematic timings and my previous impression must have been based on differences in the connection.

Jul 18 '17 15:07 andreasnoack

Bump – are we still planning on doing this?

Aug 30 '17 17:08 StefanKarpinski

bump

Apr 18 '18 15:04 bisraelsen