ROMIO: utilization of multiple threads for collective IO aggregation algorithms
With the work going on for enhanced CH4 multi-threading it has become apparent to me, at least in the case of the ROMIO one-sided collective aggregation algorithm, that multi-threading can have a significant impact on performance for data patterns that are highly discontiguous (procs sending huge numbers of discontiguous data chunks to several aggregators during each round of collective io, the climate codes come to mind). The algorithm could pretty easily be chunked with say OpenMP to concurrently do the computation of what data needs to go where and when, to create the derived types and then the ensuing MPI_Put or MPI_Get to the target aggregators, with the implementation of OFI scalable endpoints there would be a separate tx context for each thread so the mpi communication from this single endpoint would also be completely parallelized across the threads even to the same aggregator (very large derived MPI_Types sent to the same aggregator could be broken up across the threads), especially in the case of multiple cores per rank the node injection bandwidth would certainly be driven way up, and assuming the fabric would be able to accommodate it result in improved collective read or write performance. Especially on BGQ where the injection bandwidth is distributed among the 10 links, assuming different target aggregators were on torus coordinates that would use different links for messages sent from the node, and assuming no greater torus bottlenecks from the concurrent communication we should be able to measure significant performance improvements. I think this could then be a good showcase of the benefits of the new CH4 threading model and its utilization of scalable endpoints. In terms of application usage this would of course require MPI_THREAD_MULTIPLE, and I doubt most applications would benefit from calling collective IO from more than 1 thread because of the way the file IO is optimized within a given invocation -- an MPI_File_write_atall executing in thread 0 has no coordination with the MPI_File_write_atall running in thread 1 to optimize the interaction with the file system, also there would be a huge memory footprint hit with duplicatiing the collective IO buffers, as well as dupliated overhead such as MPI Window management in the case of the one-sided algorithm, so I think the only way to effectively utilize multiple threads in this situation is within the collective IO algorithm itself. So in general then I would assume there would be available threads for the say MPI_File_write_atall called from thread 0 to utilize for the computation and the MPI_Put/Get. ROMIO is a strange beast because it is coded as an application using the external MPI_* routines, there is currently an option to start a user pthread for doing the actual posix io that will overlap the IO with the aggregation for GPFS, I am wondering about what acceptable mechanism there would be for MPI-IO to utilize more threads for the collective IO aggregation algorithm itself for MPI_THREAD_MULTIPLE. Especially looking for input from Rob Latham @roblatham00 on this.
Multi-threading would be of most benefit for the read I think, as the aggregators there are very computationally and network-injection bandwidth sensitive, with the data going from just a few aggregators to potentially the entire partition, i think this is can be a major bottleneck for current serial read performance for certain io patterns.
@pkcoff in your one-sided aggregation implementation, what is the mapping of windows to aggregators? Is one big window and each aggregator contributes a portion? Or is there one window per aggregator?
The reason I ask is that currently we are looking at allocating VNIs per communicator. Since MPICH duplicates the communicator used to create a window, each window by default will get its own tx/rx context (at least until we run out and wrap around).
So for a window-per-aggregator scheme, there will be some built-in parallelism. Processes that send data to multiple aggregators will utilize multiple VNIs. However, aggregators will still be limited to receiving data over a single VNI.
On a rank designated as an aggregator there is 1 window over the collective buffer --- there is at most 1 aggregator for a give rank so to speak. Depending on options there could also be a second window over a byte counter for the amount of data moved on the aggregator. We only really need 1 VNI for each window for the management, since i think the idea is just thread 0 / channel 0 would be doing the window lock / unlock for the collective buffer and possibly byte counter window which actually requires the rx, the put or get for actual data movement should get handled by the hardware in the next-gen system, however for bgq only the put is hw accelerated so the get would still utilize multiple rx contexts on the aggregator ranks. Maybe best to discuss on the Wednesday telecon hopefully Rob could join....