Setup worker-worker connections lazily
The default all_to_all topology connects all processes to each other. While this is fine for small clusters, the total number of TCP connections increases rapidly as (N^2)/2.
Considering that a large class of parallel problems only need master-worker connections we should change the default topology to all_to_all_lazy where worker-worker connections are setup only on the first request from a worker to another worker. And also introduce another topology master_routed which only connects master to workers, and in case of a worker-worker call, routes the request through the master.
To summarize, implement 2 new topologies:
-
all_to_all_lazywhere worker-worker connections are setup lazily, and is the default for addprocs and -
master_routedin which only the master connects to workers and worker-worker messages are routed via the master.
This would solve major connection time issues on large clusters that we have repeatedly seen.
Just wanted mention that it also seemed that https://github.com/JuliaLang/julia/pull/22588 made adding remote workers noticeably faster.
I wonder how and why JuliaLang/julia#22588 affected worker startup time. @vtjnash ?
@andreasnoack / @ViralBShah care to comment on the interface for lazy connection setup in JuliaLang/julia#22814?
Sorry for the noise here. Just did some more systematic timings and my previous impression must have been based on differences in the connection.
Bump – are we still planning on doing this?
bump