[Feature Request] Network performance benchmarks
How do users notice network degradation between exo nodes in regular operation - whether that's WiFi, Thunderbolt, Ethernet, or another link? How do they characterize links, other than with simpler tools like iperf3?
A typical tool used for this is NetPIPE: https://netpipe.cs.ksu.edu/ . It uses OpenMP and SSH to answer the question: with a message size of N bytes, what is the minimum latency and maximum bandwidth between any two given nodes.
Installation
NetPIPE won't compile as-is on macOS (mpicc Apple clang-1600.0.26.4); the makefile must be modified to remove the RT library.
makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/makefile b/makefile
index 0cbcbd5..d092575 100755
--- a/makefile
+++ b/makefile
@@ -34,7 +34,7 @@ CC = gcc
ifdef MACOSX_DEPLOYMENT_TARGET
CFLAGS = -g -O3 -Wall
else
- CFLAGS = -g -O3 -Wall -lrt
+ CFLAGS = -g -O3 -Wall
endif
SRC = ./src
On Fedora 41 (mpicc gcc version 14.2.1 20240912), a different patch is required to compile:
diff --git a/src/netpipe.h b/src/netpipe.h
index ed96aa6..5bff8d0 100644
--- a/src/netpipe.h
+++ b/src/netpipe.h
@@ -99,7 +99,7 @@
// return error check variable and macro
-int err;
+//int err;
#define ERRCHECK( _ltrue, _format, _args...) do { \
if( _ltrue ) { \
np.hosts config
On any system, the self-to-self performance is measured with the following np.hosts file:
0.0.0.0 slots=2
For a two-machine system, you'd use this kind of file:
host_0_ip slots=1
host_1_ip slots=1
make mpi
mpirun -np 2 --hostfile np.hosts NPmpi -o np.mpi.qdr --start 1 --end 65536
This in turn can be plotted with npplot, which depends on gnuplot: https://gitlab.beocat.ksu.edu/PeterGottesman/netpipe-5.x/-/blob/master/npplot?ref_type=heads .
I created a gist that does everything except installation here: https://gist.github.com/phansel/26677111a61a53c0c3cdbdf94ae1a66e.
A future version of exo could characterize each path in a cluster at runtime and use that to improve resource allocation or report connectivity issues (e.g. degraded cable or connector).
I'm curious what the TB4/TB5 performance looks like between a couple of Mac Mini nodes, or between a Mac Mini and a laptop on AC power vs. on its internal battery. Not much data on 40Gb TB4 or "80Gb" TB5 latency out there.
@AlexCheema props for publishing exo!
This is interesting -- so you're saying we should model the relationship between message size and bandwidth/latency and use that information to change how we split the model?
Borrowed a couple of M2 Ultra devices and measured their performance over TB4.
For reference, iperf3 gives ~17 Gbit with a single TB4 cable and ~27 Gbit with two in parallel on a Thunderbolt Bridge.
One TB4 cable between the two devices.
Two TB4 cables.
Something isn't quite right about these plots. One 40Gbit cable shouldn't be able to carry 350 Gbps, two in parallel shouldn't be able to carry more than 80 Gbps.
I wonder if the Thunderbolt controller or OS might be compressing the packets once they're bigger than ~1KB or so.
My hosts file wasn't correct. The task was likely distributed to two processes on the same machine. The bandwidth at 65kbytes is still a bit unrealistically high, but it's more reasonable than the previous plots.
np.hosts needed to be in this format:
host1_ip slots=1
host2_ip slots=1
Single TB4 cable:
Two TB4 cables:
I also see a bit higher iperf3 speeds after re-connecting the cables: 35 Gbit/sec with one cable, 50 Gbit/sec with both. The connections have to be fully removed and re-made to hit these numbers.
Tried to get a benchmark on the 10G Ethernet built-in to these two devices. Could not get NetPIPE to progress beyond:
user@dev1 $ mpirun --hostfile np.eth.hosts NPmpi --start 1 --end 65536 -o qdr.eth
Saving output to qdr.eth
Clock resolution ~ 1.000 usecs Clock accuracy ~ 1.000 usecs
Start testing with 7 trials for each message size
[dev2][[3076,1],1][btl_tcp_frag.c:241:mca_btl_tcp_frag_recv] peer: dev1 mca_btl_tcp_frag_recv: readv failed: Operation timed out (60)
[dev2:00000] *** An error occurred in Socket closed
[dev2:00000] *** reported by process [201588737,1]
[dev2:00000] *** on a NULL communicator
[dev2:00000] *** Unknown error
[dev2:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[dev2:00000] *** and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
prterun detected that one or more processes exited with non-zero status,
thus causing the job to be terminated. The first process to do so was:
Process name: [prterun-dev1-1436@1,1]
Exit code: 14
--------------------------------------------------------------------------
1: 1 B 24999 times --> %
iperf3 works fine and shows 9.41 Gbit in either direction.
Either I'm holding it wrong or something's not behaving correctly under macOS Sequoia.