sst-elements Documentation for SST-elements Ember, Firefly and Merlin

We are currently using SST to evaluate the performance of some existing (and a couple of notional) exascale network topologies. As part of building new topologies and adding new routing algorithms, we are having some difficulties due to lack of sufficient documentation. Some of the resources we found online are slides that we presume are part of some tutorial, thereby making it harder to get enough information. It would be great if you can provide any detailed documentation on the Ember, Firefly and Merlin elements, specifying the general workflow of each component and explanation of some of the key APIs/functions in each component. That would help us understand the network simulator in better detail.

In specific, we would like to know more details about the following:

Merlin: a. What are the general steps needed to implement a new topology? Based on our understanding, we need to create a python class (need to have a build method) for this topology and declare it as a subcomponent. We also need to implement the routing logic (in .cc file). We would like to know what are the basic functions needed in the routing class (like route_packer, route_init etc..) and if they are called from other components outside merlin (like Firefly or simulator engine itself). b. A basic FSM/workflow-diagram of how Ember-Firefly-Merlin components interact during the simulation would be really useful. c. Also, what is the purpose of a virtual network (VN)?
Firefly: a. Basic documentation on what Firefly does to the event queue received from Ember before sending it to the Merlin router. What are some of the key functions and their purpose?
Ember: a. Basic documentation on how the event queues are generated from the available benchmarks. b. Can you specify more details on emberotf2.cc. motif ?

Jan 17 '22 21:01 saichenna

Here are some answers to your merlin questions:

Implementing a new topology requires two parts. A C++ SubComponent derived from the Topology class defined in sst/lelements/merlin/router.h. A python class derived from the Topology class in sst/elements/merlin/pymerlin-base.py. The purpose of the C++ class is the provide all the functionality to assign routes to packets in the router. The python class is responsible for creating the connection graph for the simulation during simulation creation.

The C++ topology class is called only by hr_router to help it know how to move packets through the network. This is a common feature of all SST SubComponents, they are only called by the object that loads them. I'll try to add some additional comments to the Topology class in router.h to better explain when the functions are called and what they are intended to do. For now, looking at the hyperx topology class in sst/elements/merlin/topology will give you a good example of what each of the functions does.

The Python topology class should not inherit from the base class in pymerlin.py. That was the initial prototype for the merlin python model and will be removed in SST 12 in favor of the new module based off of classes in pymerlin-base.py.

The merlin python module based off of classes in pymerlin-base.py is relatively new and has been in flux. It has mostly stabilized, but documentation is still a work in progress. Looking at the hyperx topology in sst/element/merlin/topology/pymerlin-topo-hyperx.py will give you a good example of how the class works. It plugs into a broader set of classes, but only interacts with them through the APIs defined in the python Topology class. There is a tricky piece that will need a brief explanation. The Python Topology class inherits from a base class that overloads many of the low level Python functions (getattr and setattr, for example). As such, you have to declare all variables that will be used in the class. If you try to use an undeclared variable, you will get an error. There are two types of these declarations: Variables and Params. The major difference is that Params are intended to be passes as parameters to underlying SST elements during build. There is a set of functions that allows you to group the parameters into sets that can be passed to the correct object. In the hyperx class, there are two calls to these functions. The call to _declareClassVariables() creates variable that are used by the class, but not passed to any elements during build(). The call to _declareParams("main", [...]) declares a set of variables that will end up in a dictionary called "main" that can later be fetched and passed a call to addParams() for an element instance in the build() function. All of the variables declared with either of these functions can be accessed like a normal class data member (i.e. comp.link_latency, comp.algorithm, etc.). Please let me know if you have more specific questions about this. Meanwhile, I'll see if can move the documentation work higher on my priority list.

The virtual network is a common abstraction used in networks. They represent a set of independent resources through the network made up of one or more virtual channels (sometimes called virtual lanes). A given packet will stay in the virtual network on which it was injected into the router, but can change virtual channels based on the requirements of the routing algorithm to avoid deadlock in the network. When the Topology object is created in hr_router, it will be given the number of virtual networks required.

Jan 18 '22 17:01 feldergast

Hello @feldergast , thanks for the help. We were able to implement our topology into SST. We’re currently trying to make sense of the simulation results. I would like to confirm my understanding on what is happening during the simulation and also get some clarification on some questions we had.

As an example case-study, I’m using a topoSingle() class to instantiate a router with 6 ports which is connected to network endpoints driven by Ember (configuration file attached).I have specified the router link bandwidth as 1GB/s, router link latency as 1ns. I was initially interested only in the time spent moving packets across links, so I set the router input and output latency to 0ns. I ran an Ember PingPong benchmark with the default size of 1024 for 1 iteration where the rank2(destination) is set to 4 (source would always be 0 I presume). I enabled the packet tracing for node 0 (assume 1 rank per node) which tracks all the packets sent by rank 0 (tracePkt is set to -2). I’m also attaching the output of the log here.

Questions regarding the communication time: a. Looking at the time stamp, the first packet is at Send method on LinkControl at first NIC during 391ns. It reached the router port at 400ns. So, I’m assuming that the hop/transmission time is 9 ns. Is the transmission time calculated through latency and bandwidth model, i.e., time per hop = (link_latency + (packet_size)/link_bandwidth)? If yes, considering 1ns for link latency, how did packet_size/link_bandwidth result to 8ns? b. Is the time advanced at the router is calculated as router_input_latency + packet_size/router_bandwidth + router_output_latency ? If not, can you specify how to calculate time advanced at the router once it receives a packet at a specific port and before it starts forwarding the packet to the output port based on it’s policy? c. Can you specify how the message of size 1024 is divided into packets? Based on the log I see only one packet (with ID 0) is being traced (packets with ID 1 and 2 are messages sent by rank 0 during the finalize stage for synchronization I guess, as they are sent to rank 1 and 2). Does the NIC or LinkControl module divide the message into packets and send it to the router? If so, what is the size of each packet? Is it equal to the output_buff_size of the LinkControl() module? d. Finally, how does the router flit_size affect the communication?

I'm sorry for asking so many questions. But this would greatly help our team both in terms of understanding the simulator and at the same time gaining confidence in the simulation results. toposingle_6nodes.txt sstoutputlogemberpingpong .

Jan 22 '22 01:01 saichennaintel

Happy to help out. Here are the answers to the questions:

a) The 391 ns is the time it takes for the init motif to complete (though I don't think you were asking about that specifically). The 9 ns comes from the time it takes the LinkControl to recognize the incoming packet. If there are no packets in the LinkControl, then the LinkControl will start sending the incoming packet, one "flit cycle" after receiving it. In your case the flit size is set to 8B and the bandwidth is 1Gb/s. This means each flit takes 8ns to serialize, which constitutes the flit cycle. So 8ns (flit cycle) + 1ns (link latency) = 9ns. Subsequent flits will pipeline, so you won't see the extra 8ns delay for every packet. Also, the packet shows as arrived when the head flit arises. The serialization latency is handled by each model, but there is no event sent for the end of the message.

b) For timing in the router: A packet will be available to the crossbar arbitration logic after the input_latency time. It will sit at the head of the queue until it succeeds in arbitrating port in the crossbar, at which point it will be transferred to the output queue. For simulation efficiency purposes, the output_latency is just added to the link latency, so the packet will look like it sent earlier than anticipated, but will be delayed the proper amount before reaching the next router. Again, only the head flit is tracked timing wise, but all the resources are marked busy based on the serialization time of the packet. Also, a packet will not advance until there is room for the entire packet in the next buffer.

c) For the default ember stack, the MTU is set to 2048B, so a 1024B message will be in a single packet. Internally, everything is tracked on flits, and the number of flits is rounded up if the packet does not divide evenly. The messages are split into packets by the ember/firefly stack.

d) In addition to the information about flit's above, the xbar_bw / flit_size will also determine the frequency of the routers core clock, which is what the arbitration unit runs at. In general, you should balance your xbar_bw and flit_size to generate a reasonable clock frequency. I tend to keep the frequency at 2 GHz or below.

Jan 24 '22 16:01 feldergast

Thank you @feldergast for the details. It certainly is helping us better understand the simulation results. I have one other question regarding the hr_router. (I'm posting it here but let me know if it would be better if I submit a new ticket for this). I would like to specify latencies and bandwidths to the individual ports in a router. Would it be possible? I have seen an enhancement ticket that has already been raised (https://github.com/sstsimulator/sst-elements/issues/102) and that this feature should be available from SST 7.0. Can you specify on how to instantiate hr_router with specific latency, bandwidth values to certain ports.

Jan 28 '22 20:01 saichennaintel

sst-elements sst-elements copied to clipboard

Documentation for SST-elements Ember, Firefly and Merlin

sst-elements
sst-elements copied to clipboard