MPI benchmark driver
This PR modifies the current driver main.cpp and adds MPI support for launching the benchmark across multiple devices. The main takeaways here:
- Each MPI rank is assigned a specific GPU and launches the benchmark
- There is no direct GPU-to-GPU communication happening
- For the dot-kernel, the resulting sums are reduced across all MPI ranks (on the host) and broadcasted to each rank (via
MPI_Allreduce). - Benchmark error checking is performed on all ranks.
- For the dot-kernel, the resulting sums are reduced across all MPI ranks (on the host) and broadcasted to each rank (via
- Measured bandwidths are aggregated across all ranks
The only major question I have is how MPI should be treated by CMake. I am open to suggestions and happy to comply with whatever you all prefer.
We've got a large general refactor coming for the main driver coming in #186.
We should also think some more about what bandwidth we expect an MPI+X version should be measuring given there is no communication apart from the dot product. I think we discussed it, but it would be good to document the reasons for wanting MPI+X versions of BabelStream vs running this benchmark on multiple nodes concurrently with pdsh, srun, etc and post-processing.