sst-elements
sst-elements copied to clipboard
Event queue empty when running motifs with large message size
Hi,
I am currently trying to run Ember ring motif but with some modification. In the original ring motif, node i sends packets to node (i+1) after receiving packets from node (i-1). For my application, I need all nodes to send the packets at the same time, thus instead of this (the original ring motif):
if ( 0 == rank() ) {
enQ_send( evQ, m_sendBuf, m_messageSize, DATA_TYPE, to, TAG,
GroupWorld );
enQ_recv( evQ, m_recvBuf, m_messageSize, DATA_TYPE, from, TAG,
GroupWorld, &m_resp );
} else {
enQ_recv( evQ, m_recvBuf, m_messageSize, DATA_TYPE, from, TAG,
GroupWorld, &m_resp );
enQ_send( evQ, m_sendBuf, m_messageSize, DATA_TYPE, to, TAG,
GroupWorld );
}
I made a slight modification:
enQ_send( evQ, m_sendBuf, m_messageSize, DATA_TYPE, to, TAG,
GroupWorld );
enQ_recv( evQ, m_recvBuf, m_messageSize, DATA_TYPE, from, TAG,
GroupWorld, &m_resp );
The motif works with message size up to 8192B but the simulation exits early when large message size is used, as shown below:
EMBER: using param directory: paramFiles
EMBER: platform: default
EMBER: network: topology=dragonfly shape=4:8:4:33
EMBER: numNodes=1056 numNics=1056
EMBER: network: BW=4GB/s pktSize=32B flitSize=32B
EMBER: Job=0, nidList='0-1055'
EMBER: Motif='Init'
EMBER: Motif='Ring iterations=1 compute=0 messagesize=16384'
EMBER: Motif='Fini'
*** Event queue empty, exiting simulation... ***
Simulation is complete, simulated time: 18.4467 Ms
For the parameters, I made the following changes:
networkParams = {
"packetSize" : "32B",
"flitSize" : "32B"
}
I use much smaller packet size for my simulation (32B instead of the default 2048B), hence much more packets are injected. Can you please tell me if there is anything that I need to be careful of when simulating with huge number of packets?
The problem occurs on both Ubuntu 18.04.5 LTS and CentOS 7.5.1804.
I built the SST from the distributed SST Core 11.0.0 and SST Elements 11.0.0 tarfiles (2021-May-03 release).
Thank you!
Nothing immediately jumps out for the ember parameters. The number of total packets should be a problem. The one issue I do see is that the dragonfly shape is not "typical", so there may be issues in the dragonfly models. The shape is building a 32-node group, which is quite small, but probably not the issue. The fact that you have 33 groups with 4 links to each group means that you have 4x the global bandwidth then you have injection bandwidth. While this should theoretically work, the routing algorithms haven't been tested with configurations that resemble that in any way. Try changing the number of intergroup links to 1 and see if you still see the same issues.