exo icon indicating copy to clipboard operation
exo copied to clipboard

[Feature Request] Network performance benchmarks

Open phansel opened this issue 11 months ago • 4 comments

How do users notice network degradation between exo nodes in regular operation - whether that's WiFi, Thunderbolt, Ethernet, or another link? How do they characterize links, other than with simpler tools like iperf3?

A typical tool used for this is NetPIPE: https://netpipe.cs.ksu.edu/ . It uses OpenMP and SSH to answer the question: with a message size of N bytes, what is the minimum latency and maximum bandwidth between any two given nodes.

Installation

NetPIPE won't compile as-is on macOS (mpicc Apple clang-1600.0.26.4); the makefile must be modified to remove the RT library.

 makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/makefile b/makefile
index 0cbcbd5..d092575 100755
--- a/makefile
+++ b/makefile
@@ -34,7 +34,7 @@ CC         = gcc
 ifdef MACOSX_DEPLOYMENT_TARGET
    CFLAGS     = -g -O3 -Wall
 else
-   CFLAGS     = -g -O3 -Wall -lrt
+   CFLAGS     = -g -O3 -Wall
 endif
 SRC        = ./src

On Fedora 41 (mpicc gcc version 14.2.1 20240912), a different patch is required to compile:

diff --git a/src/netpipe.h b/src/netpipe.h
index ed96aa6..5bff8d0 100644
--- a/src/netpipe.h
+++ b/src/netpipe.h
@@ -99,7 +99,7 @@
 
    // return error check variable and macro
 
-int err;
+//int err;
 
 #define ERRCHECK( _ltrue, _format, _args...) do {         \
    if( _ltrue ) {                                   \

np.hosts config

On any system, the self-to-self performance is measured with the following np.hosts file:

0.0.0.0 slots=2

For a two-machine system, you'd use this kind of file:

host_0_ip slots=1
host_1_ip slots=1
make mpi
mpirun -np 2 --hostfile np.hosts NPmpi  -o np.mpi.qdr --start 1 --end 65536

This in turn can be plotted with npplot, which depends on gnuplot: https://gitlab.beocat.ksu.edu/PeterGottesman/netpipe-5.x/-/blob/master/npplot?ref_type=heads .

I created a gist that does everything except installation here: https://gist.github.com/phansel/26677111a61a53c0c3cdbdf94ae1a66e.

A future version of exo could characterize each path in a cluster at runtime and use that to improve resource allocation or report connectivity issues (e.g. degraded cable or connector).

I'm curious what the TB4/TB5 performance looks like between a couple of Mac Mini nodes, or between a Mac Mini and a laptop on AC power vs. on its internal battery. Not much data on 40Gb TB4 or "80Gb" TB5 latency out there.

@AlexCheema props for publishing exo!

phansel avatar Jan 04 '25 07:01 phansel

This is interesting -- so you're saying we should model the relationship between message size and bandwidth/latency and use that information to change how we split the model?

AlexCheema avatar Jan 06 '25 14:01 AlexCheema

Borrowed a couple of M2 Ultra devices and measured their performance over TB4.

For reference, iperf3 gives ~17 Gbit with a single TB4 cable and ~27 Gbit with two in parallel on a Thunderbolt Bridge.

One TB4 cable between the two devices. np-M2UltraToM2Ultra-latency np-M2UltraToM2Ultra-bw

Two TB4 cables. np-M2UltraToM2Ultra-latency np-M2UltraToM2Ultra-bw

Something isn't quite right about these plots. One 40Gbit cable shouldn't be able to carry 350 Gbps, two in parallel shouldn't be able to carry more than 80 Gbps.

I wonder if the Thunderbolt controller or OS might be compressing the packets once they're bigger than ~1KB or so.

phansel avatar Jan 07 '25 02:01 phansel

My hosts file wasn't correct. The task was likely distributed to two processes on the same machine. The bandwidth at 65kbytes is still a bit unrealistically high, but it's more reasonable than the previous plots.

np.hosts needed to be in this format:

host1_ip slots=1
host2_ip slots=1

Single TB4 cable: np-M2Ultra-oneTB4-bw np-M2Ultra-oneTB4-latency

Two TB4 cables: np-M2Ultra-bw np-M2Ultra-latency

I also see a bit higher iperf3 speeds after re-connecting the cables: 35 Gbit/sec with one cable, 50 Gbit/sec with both. The connections have to be fully removed and re-made to hit these numbers.

phansel avatar Jan 07 '25 22:01 phansel

Tried to get a benchmark on the 10G Ethernet built-in to these two devices. Could not get NetPIPE to progress beyond:

user@dev1 $ mpirun --hostfile np.eth.hosts NPmpi --start 1 --end 65536 -o qdr.eth          
Saving output to qdr.eth
      Clock resolution ~   1.000 usecs      Clock accuracy ~   1.000 usecs
Start testing with 7 trials for each message size
[dev2][[3076,1],1][btl_tcp_frag.c:241:mca_btl_tcp_frag_recv] peer: dev1 mca_btl_tcp_frag_recv: readv failed: Operation timed out (60)
[dev2:00000] *** An error occurred in Socket closed
[dev2:00000] *** reported by process [201588737,1]
[dev2:00000] *** on a NULL communicator
[dev2:00000] *** Unknown error
[dev2:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[dev2:00000] ***    and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
prterun detected that one or more processes exited with non-zero status,
thus causing the job to be terminated. The first process to do so was:

   Process name: [prterun-dev1-1436@1,1]
   Exit code:    14
--------------------------------------------------------------------------
  1:       1  B     24999 times -->  %  

iperf3 works fine and shows 9.41 Gbit in either direction.

Either I'm holding it wrong or something's not behaving correctly under macOS Sequoia.

phansel avatar Jan 08 '25 22:01 phansel