MLLinOp::makeSubCommunicator() does not scale well to O(10k) ranks
We discovered while running the Castro flame_wave problem on 2048 Summit nodes (6 ranks per node) that the average time per call to MLLinOp::makeSubCommunicator() was 0.06 seconds, which is twice as expensive as a hydro advance on GPUs at that scale (for comparison purposes).
Note: #998 partially addressed the situation because I believe we observed a case in Castro where building the subcommunicator was unnecessary. However I am not sure if that addressed the original issue we observed on Summit.
Maybe we should prebuild a number of subcommunicators with various number of processes. In the coarsened MG levels, we don't need to use a subcommunicator that is exactly the size of the DistributionMapping.