sst-elements icon indicating copy to clipboard operation
sst-elements copied to clipboard

The execution time of multi-node between switches is long

Open sunsirui opened this issue 1 year ago • 2 comments

Hello, I test the case:

  1. Between switch, 8 nodes.
  2. Use Allpingpang_motifs, message size = 1048576B
  3. fat-tree topology

The complete configuration is as follows: ####################### from email.mime import base import sst from sst.merlin.base import * from sst.merlin.endpoint import * from sst.merlin.interface import * from sst.merlin.topology import * from sst.ember import *

if name == "main":

PlatformDefinition.setCurrentPlatform("firefly-defaults")

### Setup the topology
topo = topoFatTree()

topo.shape = "4,1:2"
topo.routing_alg  = "deterministic"

# Set up the routers
router = hr_router()
router.link_bw = "25GB/s"
router.flit_size = "8B"
router.xbar_bw = "16Tb/s"
router.input_latency = "130ns"
router.output_latency = "130ns"
router.input_buf_size = "64kB"
router.output_buf_size = "64kB"
router.xbar_arb = "merlin.xbar_arb_lru"

topo.router = router
topo.link_latency = "300ns"

### set up the endpoint
networkif = LinkControl()
networkif.link_bw = "12.5GB/s"
networkif.input_buf_size = "64kB"
networkif.output_buf_size = "64kB"

ep = EmberMPIJob(0,8,1,1)
ep.network_interface = networkif

### set up the MPI
ep.addMotif("Init")
ep.addMotif("AllPingPong iterations=1 messageSize=1048576")
ep.addMotif("Fini")
system = System()
system.setTopology(topo)
system.allocateNodes(ep,"linear")
system.build()

####################### The simulation time result is not output, and the trace log is printed to find that the program is continuously executing, sending, receiving, and copying data between nic, swtich, and nic, but the process is very long. The puzzled question is, aren't threads executed in parallel in SST? Why is it very time-consuming to have only eight nodes with 1M data. By the way, after testing, it is found that the eight nodes within the switch are running normally, and only the big data between the switches will be stuck.

Thank you.

sunsirui avatar Jun 28 '23 07:06 sunsirui

I suspect this is the culprit:

router.xbar_bw = "16Tb/s"

With a flit size of 8B, this would create a switch clock rate of 250 GHz. If you are running in parallel, then you're spending all your time synchronizing at a period of about 4 ps. I think you are misinterpreting the xbar_bw field, which is a per port bandwidth, which is usually set somewhere between 1x and 2x the link bandwidth, usually around 1.5x. Even at that, you may want to increase the flit size to create a reasonable clock frequency, so maybe 16B, which would give you a clock rate of a little over 2 GHz, if you use a 1.5x multiplier on xbar over link bandwidth.

feldergast avatar Jul 18 '23 21:07 feldergast

@sunsirui Just wanted to check if you were still having issues with this. If not, can we close this issue?

feldergast avatar Aug 03 '23 20:08 feldergast