hpc-intro Change parallel python example

Would like to change the parallel python example to the code below

import numpy as np
import sys
import datetime
from mpi4py import MPI


def inside_circle(total_count):
    x = np.random.uniform(size=total_count)
    y = np.random.uniform(size=total_count)
    radii = np.sqrt(x * x + y * y)
    count = len(radii[np.where(radii <= 1.0)])
    return count


def main():
    comm = MPI.COMM_WORLD
    n_cpus = comm.Get_size()
    rank = comm.Get_rank()
    n_samples = int(sys.argv[1])
    if rank == 0:
        my_samples = n_samples - (n_cpus - 1) * np.int(np.floor(n_samples / n_cpus))
    else:
        my_samples = np.int(np.floor(n_samples / n_cpus))

    comm.Barrier()
    start_time = datetime.datetime.now()
    my_counts = inside_circle(my_samples)
    counts = comm.allreduce(my_counts, op=MPI.SUM)
    comm.Barrier()
    end_time = datetime.datetime.now()
    elapsed_time = (end_time - start_time).total_seconds()
    my_pi = 4.0 * counts / n_samples
    if rank == 0:
        print("Pi: {}, time: {} s".format(my_pi, elapsed_time))


if __name__ == "__main__":
    main()

Would also like to not use a large array for np.random.uniform(size=total_count) since this is not required, A loop in python is slow, the novice lesson does better optimization, but I do not know if a discussion of vectorization is needed in the intro lesson.

Comments appreciated.

reformatted by @tkphd using black, primarily for spacing

Apr 25 '21 15:04 bkmgit

I think @reid-a made some excellent decisions in the lesson code, from the perspective of teaching how to use clusters.

The algorithm is easy to explain, understand, and implement, even naïvely.
Vanilla Python loops are slow, but the NumPy library is vectorized C code and very efficient. This applies to ~both~ the Monte Carlo computation ~and the final vector sum (reduction)~.
Vectorized NumPy expressions dramatically reduce the number of lines of code, improving readability and decreasing the time commitment of live-coding.
To get the vectorized performance, you need a vector.
The memory required for these arrays forces this off of local hardware, if there is to be any accuracy in the result.

Changes could be made, and perhaps in HPC Python we should revisit this example to teach some better, higher-performance practices. That's a pretty good teaching pattern: introduce a bad way to do something, then incrementally improve it to show what's possible.

Apr 25 '21 15:04 tkphd

MPI Allreduce is a blocking operation, meaning that it includes barriers internally. The two calls to Barrier() in the proposed code can be removed.

Apr 25 '21 15:04 tkphd

Compiled codes will generally vectorize the loop, so significant memory is not needed. Interpreted codes will need to use some sort of library. Is this a discussion worth having early on? I think the memory discussion can be postponed or some reason for using large amounts of memory in interpreted codes given.

Apr 25 '21 15:04 bkmgit

I think we should focus on the Serial vs. Parallel aspect, and stay away from discussing Interpreted vs Compiled languages, Vectorization, and memory footprint at this stage. The dedicated HPC Python lesson would be a much more appropriate place.

Compilers will unroll loops, but I'm not sure that's the same thing as vectorizing. Perhaps in the best case, with simple loop kernels and clear guards, the compiler will use a vector instruction, but again, there has to be a vector, and compilers tend to be conservative -- I think. I could be entirely mistaken, in which case enlightenment is welcome.

Apr 25 '21 15:04 tkphd

Ok, the memory material can be moved to the HPC Python lesson. Most C, Fortran, C++ compilers will vectorize such loops with optimizations turned on.

Apr 25 '21 15:04 bkmgit

MPI Allreduce is a blocking operation, meaning that it includes barriers internally. The two calls to Barrier() in the proposed code can be removed.

The first barrier is needed to ensure timing is done for all processes. The second one can be removed due to the allreduce. Is it clearer to use an allreduce instead of reduce?

Apr 25 '21 16:04 bkmgit

For clarity, unless a Barrier is absolutely necessary, I feel that they should both be removed. Since the time-consuming parts are (1) local computation and (2) allreduce, and because MPI Reduce and Allreduce are both blocking (i.e., call Barrier internally), the deviation in timing between the Barrier and non-Barrier versions ought to be negligible. We can certainly test to make sure.

Either Reduce or Allreduce is applicable here; perhaps, since not every rank needs the final answer, calling Reduce instead would be better. The function arguments are the same: mpi4py assumes rank 0 is the target if nothing is specified.

Apr 25 '21 17:04 tkphd

The first barrier is needed. We want to write portable code. MPI does not need to run on homogeneous processors, so one can come up with a situation where some processors finish much faster than others.

Can replace allreduce with reduce.

Apr 25 '21 17:04 bkmgit

Portable code is a worthwhile goal, but introducing too many corner-case details is going to overwhelm learners. The goal of this lesson is to introduce basic concepts in small, bite-sized increments. Just conceptualizing parallel Reduce is enough.

This timer only operates on rank 0. If we really want to know how long it takes, we should call Reduce on the timer data as well -- which would actually serve the purpose of the lesson, i.e., reinforcing that the variables defined in the function are local to each process, not shared. We could use either MPI_MAX to find the longest runtime, or MPI_SUM and divide by n_cpus to get the average.

Apr 25 '21 17:04 tkphd

MPI_MAX is good and portable. Should one be using a heterogeneous cluster, this would enable a good discussion on load balancing.

Apr 25 '21 17:04 bkmgit

Such a conversation would be out of scope for this introductory lesson.

Apr 25 '21 17:04 tkphd

hpc-intro hpc-intro copied to clipboard

Change parallel python example

hpc-intro
hpc-intro copied to clipboard