tensorflow-allreduce icon indicating copy to clipboard operation
tensorflow-allreduce copied to clipboard

Can allreduce (and other multi-GPU communications) work via SLI?

Open ai-bits opened this issue 7 years ago • 3 comments

For quite some time I've been wondering how to get a "poor-man's" 4-GPU-accelerated DL machine at maximum price / performance ratio.

PCIe is said to be the communications bottleneck. DGX-1 @ 125k is no-go, DIGITS DevBox hardware is partly jaded and not exactly a snap at 15k...

Now there will be the GTX 1080TI @ half the Titan price... What would be a 4-way hardware @ rock-bottom price / perf? Nvidia says SLI is fast, but not how fast. Any chance to use SLI as comm between the GPUs for allreduce?

Thanks G.

ai-bits avatar Mar 01 '17 20:03 ai-bits

For fast allreduce - you would want all 4 on the same PCI-E root complex. These motherboards are not always cheap - on a 4 GPU motherboard - they will probably split the 4 GPUs across two root complexes.

We haven't tried SLI - we use OpenMPI as our transport layer. If OpenMPI supports SLI in their Byte Transport Layer (btl) then allreduce will automatically work across it.

Updating with what I learned from skimming the SLI docs - the communication is handled by the driver so that you can do alternate frame rendering. So this is very much only used for graphics and only the driver (I am assuming) knows the communication protocol across SLI. So I think you are out of luck trying to use SLI for allreduce.

shubho avatar Mar 01 '17 21:03 shubho

Thanks a ton for your take on this. I appreciate it!

I fear you are right that SLI is too little general purpose and just for splitting up graphics rendering and physics calc in games or VR, specifically prepared for that. And then I could get nowhere hold of any transfer speed numbers.

Upon reading up more: As opposed to their Deep Learning prowess, Nvidia consumer docs on SLI are totally outdated (7 series!), e.g. implying you need dual GPU graphics cards (not available for 10 series) to get to 4-way SLI. Only closer study of motherboard makers reveals they all deliver several SLI bridges with their mobos to get to 2-/3-/4-way.

The compromise with PCIe: Due to CPU PCIe lane restrictions in consumer hardware, lanes are multiplexed and four physical x16 PCIe slots mostly end up as four logical x8 ones. (or x16/x16 or x16/x8/x8)

I guess a perf prognosis with the different factors (transfer speed, latency,...) would be very hard to make, so I'll simply take the plunge in the coming weeks.

Thanks again G.

ai-bits avatar Mar 02 '17 13:03 ai-bits

Your best bet is to buy the motherboard that has the maximum number of GPUs per PCI-E root complex and that fits in your budget. You have to go through the specs of the motherboard to figure this out - it is not always spelled out clearly. Tyan or SuperMicro should have something.

shubho avatar Mar 02 '17 18:03 shubho