tensorpipe
tensorpipe copied to clipboard
Add a Herring-like benchmark
Summary: This diff adds two benchmarks that perform an allreduce operation using internally a technique inspired from Amazon's Herring paper. The two versions, one for TCP and one for InfiniBand interconnect, operate similarly: they spawn a set of clients (one per device, with multiple devices per machine, and multiple machines), then each allreduce call first performs a NCCL reduce or reduce_scatter step within each machine, then a shared send to a set of servers (one per machine, or one per device), then these servers perform the aggregation, and send back the result to the clients, which then do a final NCCL broadcast or all_gather step. Multiple allreduce calls, one per "bucket", are launched in parallel during each "epoch". There's also a baseline script that does these allreduces simply using NCCL.
Since the benchmarks depend on PyTorch, they are a bit "awkward" to build and install from within the TensorPipe repo. So, instead of integrating them inside our CMake system, I opted to build them as a separate PyTorch extension, hence giving them a setup.py file. Thus the installation steps are to first install TensorPipe on its own, then NCCL, then PyTorch, and then the benchmarks.
Differential Revision: D30220559
This pull request was exported from Phabricator. Differential Revision: D30220559