ROSS icon indicating copy to clipboard operation
ROSS copied to clipboard

avoid rounding up message sizes in ROSS

Open JohnPJenkins opened this issue 9 years ago • 7 comments

ROSS automatically rounds up the size of all messages to either g_tw_event_msg_sz in the normal case or 500 bytes if ROSS_MEMORY is set. We could improve performance for models using multiple LP types by only transmitting the size of each event and handling short receives on the receiving side.

JohnPJenkins avatar Oct 29 '14 20:10 JohnPJenkins

Looking more in-detail about this, this optimization would actually break a significant fraction of CODES. In particular, some models funnel messages between intermediary LPs, relying on oversubscribed event memory sizes to stuff the recipient funneled message into. This is used both in the codes-base local storage model and pervasively in modelnet.

The proper way to do the kind of variable-length messaging we need to do is use the tw_memory features, but it's off in ROSS by default and we aren't sure what the stability of it is.

JohnPJenkins avatar Oct 29 '14 20:10 JohnPJenkins

ROSS pre-allocates all event memory. Thus, every event must be able to hold the maximum event size. We currently see this as an optimization... and probably won't change it any time soon.

However, we definitely need to look into ROSS_MEMORY both to ensure that it is functioning and to document its functionality.

gonsie avatar Oct 30 '14 15:10 gonsie

I think there might have been a bit of a misunderstanding here... the ticket was discussing was the sending of events between MPI ranks. Some events (particularly for our modelnet code in CODES) are much larger than others. Event sending currently does not discriminate between event types and uses the upper bound for message sizes. A fix would involve specializing the send buffer size on the type of event, which would improve performance as there would be less message data being shipped around. No changes to the upper bounding of event sizes or the allocation scheme are required here.

JohnPJenkins avatar Oct 30 '14 16:10 JohnPJenkins

Would having the ability to request a large event solve this issue?

Another project has requested an additional API function: tw_event_new_large(..., size_t size);. This function would allocate additional space for your event, which ROSS would be responsible for managing. Hopefully this function would be used relatively infrequently and most events would be a standard size.

gonsie avatar May 16 '16 22:05 gonsie

My 2c:

I think there might be a use for something like tw_event_new_large(), but I think an easier problem to tackle is that even if the pre-allocated event buffers have to be a fixed size (because a given PE has no idea what events will arrive, so it has to be prepared for the largest possible), that doesn't necessarily mean that we always have to transmit messages that size when sending a remote event with MPI.

MPI matches short recvs just fine. So even if a pool of events is preallocated to hold, say, 600 bytes, when a given LP's event struct is only 100 bytes then it should be able to send that and have it land in the first 100 bytes of one of those available buffers. So it would reduce message sizes, not memory utilization.

I think this would be a performance win for heterogeneous models. The codes-rebuild model, for example, has 4 different LP types, each with a different event size. Even if we round up the event buffers to be the largest possible across all LPs (presently 800+ bytes in that model), we could probably avoid literally putting 800+ bytes on the wire for every remote message sent by the simulation.

Even modelnet (which as John pointed out, encapsulates other LP event types into its own payload) isn't always funneling encapsulating the same size payload; it takes a run-time parameter to each msg event to tell it how much data we need to wrap within the modelnet event. So it likewise needs to be able to support a pre-defined max amount of pre-allocated event memory in any given msg (just in case), but it doesn't necessarily need to transmit that much through MPI every time.

carns avatar May 27 '16 15:05 carns

On the modelnet connection - we essentially increase the global message size by 2x to allow piggybacking modelnet user events with modelnet events. In reality, for the performance-sensitive transports (dragonfly, torus, slimfly, fattree), the messages that carry that payload are a very small fraction of the total events used (the last flit of a transfer, so an 8-byte flit and 4K message payload would mean that 511/512 events don't use the extra space (actually more like 1023/1024 when you consider credit flow control)). Short reads could help a great deal in this case, esp. since dragonfly LPs tend to not to have a lot of locality to exploit at scale.

JohnPJenkins avatar May 27 '16 15:05 JohnPJenkins

As it pertains to the API, Phil's suggestion would require telling ROSS the effective event size on every tw_event_new. In that case, tw_event_new_large would become tw_event_new_sized. Then the size passed in can be arbitrary. if <= g_tw_event_size then proceed as normal and handle short recvs and if > do a multi-round-trip transfer (or use tagging to bin posted recvs by size, ie 4K, 8K, ...).

JohnPJenkins avatar May 27 '16 16:05 JohnPJenkins