stylable icon indicating copy to clipboard operation
stylable copied to clipboard

Much slower than Horovod on TCP

Open azuresol opened this issue 4 years ago • 14 comments

Describe the bug I benchmarked BytePS and Horovod's performance using this script using 4VM * 8 V100 on TCP. It turned out that the performance I got from BytePS is much lower than Horovod. I wonder if this is expected?

Horovod (4 worker VMs):

# Message size (MB) / Throughput (MB/s)
{ 64: 628.1127256680329,
  128: 1637.494250997945,
  256: 1606.363227578568,
  512: 2162.496918009337,
  1024: 2210.063743908007}

BytePS (4 worker VMs + 4 server VMs):

{ 64: 188.61641825541528,
  128: 206.0769111278037,
  256: 216.58532511299617,
  512: 216.87162749269987,
  1024: 224.94936582978187}

I also ran BytePS on a single worker setting (without server and scheduler) in order to pinpoint the issue, but the result is also lower than expected:

{ 64: 665.3096975939151,
  128: 768.8773438853825,
  256: 787.3222907256899,
  512: 772.3518927654511,
  1024: 768.8109970613676}

Environment (please complete the following information):

  • OS: Debian 9
  • GCC version: GCC 7
  • CUDA and NCCL version: CUDA 10 and NCCL 2.5.6
  • Framework (TF, PyTorch, MXNet): TF 1.15

Additional context The VM type is n1-highmem-96 on GCP with 100G networking. GPUs are connected by NVLinks (similar to DGX-1).

azuresol avatar Apr 12 '20 18:04 azuresol

Your benchmark code sends tensors one by one. Once BytePS saws a new tensor, it requires expensive initialization of that tensor buffer. So your performance is expected. We recommend using the end-to-end benchmark to compare the performance.

Here are some other tuning tips, which should be helpful for TCP: https://github.com/bytedance/byteps/issues/230#issuecomment-611515746.

ymjiang avatar Apr 13 '20 00:04 ymjiang

@azuresol As @ymjiang said, for each tensor, when you send for the first time, BytePS would need additional time for initializing it on both the worker side and server side. Horovod does not need this because all-reduce is stateless (while PS is stateful). However, this is just for the first round.

In addition to what @ymjiang suggests, you may also refactor your code to push_pull the same tensors for multiple times, and measure the time for the second or later rounds.

bobzhuyb avatar Apr 13 '20 01:04 bobzhuyb

Thanks, I will try again with the refactored benchmark. Does BytePS need expensive initialization even without servers? I assumed it was equivalent to plain NCCL operations, but I might be misunderstanding.

azuresol avatar Apr 13 '20 02:04 azuresol

A quick update: so I added a round of session.run() before measuring the actual execution. The performance number of BytePS improved, but is still lower than Horovod. Any ideas?

Horovod:
{ 64: 1776.7572510191353,
  128: 1720.4674884248109,
  256: 2045.2847935221362,
  512: 2121.0158884188995,
  1024: 2309.7685065791807}

BytePS:
{ 64: 1296.940501910737,
  128: 1457.7873321558939,
  256: 1568.298796971928,
  512: 1618.0508589138265,
  1024: 1677.4659052153513}

azuresol avatar Apr 13 '20 03:04 azuresol

@azuresol Can you increase the number of server processes? You can keep using those 4 server VMs, but start more server processes on each VM.

For TCP, a single connection cannot saturate the 100Gbps bandwidth. NCCL establishes multiple rings to overcome that. Correspondingly, You need more server instances for BytePS.

Anyways, it does not make too much sense running TCP on 100Gbps networks, though it's a shame that some cloud providers do not offer RDMA on high speed network. The BytePS support for AWS EFA is under development, and we are also looking at ways to optimize BytePS TCP. It may take some time though.

bobzhuyb avatar Apr 13 '20 03:04 bobzhuyb

Hi @bobzhuyb, agree that RDMA could offer better performance, but it is still interesting to find out why BytePS underperforms on TCP -- IIUC, given the same bus bandwidth (even if capped by TCP at a BW smaller than 100Gbps), the theoretical algorithm bandwidth of BytePS should be twice as NCCL's, according to BytePS's docs and benchmark.

If I remember correctly, each NCCL ring opens multiple threads and TCP connections. Does BytePS do the same, or there is only one TCP connection between a pair of worker and server? Anyway, I would try adding more servers and see if things improve. Thanks.

azuresol avatar Apr 13 '20 04:04 azuresol

I used 16 servers (4 VMs in total) and 4 workers. It turns out that with more servers BytePS still underperforms :(

16 servers
{ 64: 645.2358540538997,
  128: 1114.3372700397051,
  256: 1446.9432939826472,
  512: 1890.3294603886657,
  1024: 2102.646180203336}

8 servers:
{ 64: 1002.1344917894053,
  128: 1226.0040569054331,
  256: 1526.4048331427127,
  512: 1767.0737988043104,
  1024: 1960.293212291152}

@ymjiang: OMP_WAIT_POLICY=PASSIVE trick does not work for me either.

azuresol avatar Apr 13 '20 04:04 azuresol

@azuresol Your results show that more servers lead to slower throughput? It does not make sense to me.. Can you share your latest code?

Between a pair of worker/server, there is only on TCP connection. Even with more servers, there still exists bottlenecks at the ZeroMQ, which BytePS inherits from the original ps-lite. ZeroMQ has serious throughput problem (see https://github.com/zeromq/libzmq/issues/3525 that I asked libzmq a while ago). We thought we had a workaround, but it later seemed that it only solved a part of the problem. To make BytePS+TCP really work well beyond 25Gbps, we probably need to re-implement the TCP part without ZeroMQ.

bobzhuyb avatar Apr 13 '20 04:04 bobzhuyb

To be precise, I used 4 server VMs and increased the number of server processes by repeating the hosts in server-hostfile. My benchmark code is pretty much same as before. For completeness, here is the command I run:

/opt/conda/bin/python byteps/launcher/dist_launcher.py --worker-hostfile=WORKER_HOSTS --server-hostfile=SERVER_HOSTS --scheduler-ip=10.128.0.7 --scheduler-port=12345 --username=root --env=NVIDIA_VISIBLE_DEVICES:0,1,2,3,4,5,6,7 'echo this is $DMLC_ROLE; /opt/conda/bin/python byteps/launcher/launch.py /opt/conda/bin/python allreduce_bench.py'

I would like to make sure that it is not the issue of my setup. Let me know whether you are able to reproduce this result in 100Gbps TCP network.

azuresol avatar Apr 13 '20 05:04 azuresol

@azuresol I kind of understand the problem you have with more servers. Can you set BYTEPS_PARTITION_BYTES=1024000, or maybe 512000? This should significantly improve the performance of 16 servers case. The default value 4096000 is good for 100G RDMA, but not for your case.

That said, even with this, I am afraid that ZeroMQ would still limit the BytePS throughput to no more than a single thread memcpy

bobzhuyb avatar Apr 13 '20 22:04 bobzhuyb

Setting BYTEPS_PARTITION_BYTES=1024000 did improve the performance of 16 servers. However, 512000 led to worse results.. What's the rationale of tuning BYTEPS_PARTITION_BYTES?

BYTEPS_PARTITION_BYTES=1024000:
{ 64: 1117.1740602953446,
  128: 1514.1138371096351,
  256: 1697.4191211669086,
  512: 2032.568272083459,
  1024: 2190.6864842206014}

BYTEPS_PARTITION_BYTES=512000:
{ 64: 694.2803987107301,
  128: 1139.2819746015773,
  256: 1525.4919535014435,
  512: 1900.7866386294816,
  1024: 2013.8684039364014}

It is true that the performance gap still exists, let alone the theoretical 2x improvements. What's the best approach to confirm that ZeroMQ is the bottleneck?

azuresol avatar Apr 14 '20 04:04 azuresol

BYTEPS_PARTITION_BYTES defines how byteps partition large tensors. Smaller partition sizes give you better push/pull pipelining, but also hurt the network stack performance..

If you really want to see the performance improvement on GCP, you can use VMs with 25Gbps or even 10Gbps network. You can also use iftop to confirm that BytePS indeed saves bandwidth compared with all-reduce.

If your target scenario is only 100Gbps TCP, you may have to wait for a few weeks until we re-implement the TCP part.

bobzhuyb avatar Apr 14 '20 05:04 bobzhuyb

Hi @bobzhuyb: What is the typical throughput number for the test_benchmark test in 100Gbps (or your typical setup) RDMA and TCP networks? Just wanted to help myself understand how to interpret my own result. Thanks.

azuresol avatar Apr 29 '20 03:04 azuresol

@azuresol If you have 100G RDMA network, test_benchmark should get you >85Gbps application throughput. For TCP, it should be similar to your iperf single connection performance.

We do have something coming very soon. https://github.com/bytedance/ps-lite/pull/31 Hopefully, we can get better performance with TCP. Stay tuned..

bobzhuyb avatar Apr 29 '20 04:04 bobzhuyb