hpc-intro icon indicating copy to clipboard operation
hpc-intro copied to clipboard

Simplifying the parallel application example

Open ocaisa opened this issue 3 years ago • 5 comments

Based on our discussion yesterday, I created a simplified version of an mpi4py example that exactly reproduces Amdahls law. You could use it as a black box, or explain the code.

#!/usr/bin/env python
"""
Amdahls law illustrator (with fake work)
"""

from mpi4py import MPI
import sys
import time
import argparse


def do_work(work_time=30, parallel_proportion=0.5, comm=MPI.COMM_WORLD):
    # How many MPI ranks (cores) are we?
    size = comm.Get_size()
    # Who am I in that set of ranks?
    rank = comm.Get_rank()
    # Where am I running?
    name = MPI.Get_processor_name()

    if rank == 0:
        # use Amdahls law to calculate the expected speedup for a given workload
        amdahl_speed_up = 1.0 / (
            (1.0 - parallel_proportion) + parallel_proportion / size
        )

        # Set the sleep times (which are used to fake the amount of work)
        serial_sleep_time = float(work_time) * (1.0 - parallel_proportion)
        parallel_sleep_time = (float(work_time) * parallel_proportion) / size

        sys.stdout.write(
            "Processors will do %s seconds of 'work', which should take %s seconds "
            "on %s cores with %s parallel proportion of the workload.\n"
            % (work_time, work_time / amdahl_speed_up, size, parallel_proportion)
        )

        sys.stdout.write(
            "Hello, World! I am process %d of %d on %s and I will do all the serial "
            "'work' for %s seconds.\n" % (rank, size, name, serial_sleep_time)
        )
        time.sleep(serial_sleep_time)
    else:
        parallel_sleep_time = None

    # Tell all processes how much work they need to do using 'bcast' to broadcast
    # (this also creates an implicit barrier, blocking processes until they recieve
    # the value)
    parallel_sleep_time = comm.bcast(parallel_sleep_time, root=0)

    # This is where everyone pretends to do work (while really we are just sleeping)
    sys.stdout.write(
        "Hello, World! I am process %d of %d on %s and I will do parallel 'work' for "
        "%s seconds.\n" % (rank, size, name, parallel_sleep_time)
    )
    time.sleep(parallel_sleep_time)


# Only the root process handles the command line arguments
rank = MPI.COMM_WORLD.Get_rank()
if rank == 0:
    # Start a clock to measure total time
    start = time.time()
    # Initialize our argument parser
    parser = argparse.ArgumentParser()

    # Adding optional arguments
    parser.add_argument(
        "-p",
        "--parallel-proportion",
        nargs="?",
        const=0.5,
        type=float,
        default=0.5,
        help="Parallel proportion should be a float between 0 and 1",
    )
    parser.add_argument(
        "-w",
        "--work-seconds",
        nargs="?",
        const=30,
        type=int,
        default=30,
        help="Total seconds of workload, should be an integer greater than 0",
    )

    # Read arguments from command line
    args = parser.parse_args()

    if not args.work_seconds > 0:
        parser.print_help()
        MPI.COMM_WORLD.Abort(1)
        sys.exit(1)

    if args.parallel_proportion <= 0 or args.parallel_proportion > 1:
        parser.print_help()
        MPI.COMM_WORLD.Abort(1)
        sys.exit(1)

    do_work(work_time=args.work_seconds, parallel_proportion=args.parallel_proportion)
    end = time.time()
    sys.stdout.write(
        "Total exection time (according to rank 0): %s seconds" % (end - start)
    )
else:
    do_work()

This would give output like

[ocaisa@gpunode1 ~]$ time mpirun --oversubscribe -n 10 python amdahl.py -p 0.8 -w 20
Processors will do 20 seconds of 'work', which should take 5.6 seconds on 10 cores with 0.8 parallel proportion of the workload.
Hello, World! I am process 0 of 10 on gpunode1.int.eessi-gpu.learnhpc.eu and I will do all the serial 'work' for 3.999999999999999 seconds.
Hello, World! I am process 1 of 10 on gpunode1.int.eessi-gpu.learnhpc.eu and I will do parallel 'work' for 1.6 seconds.
Hello, World! I am process 0 of 10 on gpunode1.int.eessi-gpu.learnhpc.eu and I will do parallel 'work' for 1.6 seconds.
Hello, World! I am process 2 of 10 on gpunode1.int.eessi-gpu.learnhpc.eu and I will do parallel 'work' for 1.6 seconds.
Hello, World! I am process 7 of 10 on gpunode1.int.eessi-gpu.learnhpc.eu and I will do parallel 'work' for 1.6 seconds.
Hello, World! I am process 3 of 10 on gpunode1.int.eessi-gpu.learnhpc.eu and I will do parallel 'work' for 1.6 seconds.
Hello, World! I am process 4 of 10 on gpunode1.int.eessi-gpu.learnhpc.eu and I will do parallel 'work' for 1.6 seconds.
Hello, World! I am process 5 of 10 on gpunode1.int.eessi-gpu.learnhpc.eu and I will do parallel 'work' for 1.6 seconds.
Hello, World! I am process 6 of 10 on gpunode1.int.eessi-gpu.learnhpc.eu and I will do parallel 'work' for 1.6 seconds.
Hello, World! I am process 8 of 10 on gpunode1.int.eessi-gpu.learnhpc.eu and I will do parallel 'work' for 1.6 seconds.
Hello, World! I am process 9 of 10 on gpunode1.int.eessi-gpu.learnhpc.eu and I will do parallel 'work' for 1.6 seconds.
Total exection time (according to rank 0): 5.650804281234741 seconds
real    0m8.510s
user    0m4.143s
sys     0m7.540s

so would also exhibit that there is other overhead to consider. You can oversubscribe this to your hearts content since it is not actually doing any calculations.

ocaisa avatar Jun 18 '21 11:06 ocaisa

Hmm, I think I can do better here, I can set the actual amount of serial time for the root (default is 0.5 * work_time), and distribute a trivial parallel sleep time (default would be (0.5 * work_time)/n_proc).

ocaisa avatar Jan 18 '22 09:01 ocaisa

Nice example. Probably one could also use this for parallel prefix scan, or parallel merge sort algorithms if one wants to demonstrate something other than sleep. Embarrassingly parallel algorithms are less good for this.

bkmgit avatar Jan 18 '22 09:01 bkmgit

Sure, but my point here is not an the algorithm at all (that would be just another thing that would have to be explained), it's to show that something can be made of serial and parallel work, and illustrate Amdahls law as a result.

ocaisa avatar Jan 18 '22 10:01 ocaisa

I have updated the code in the initial comment to take command line args and also show the total time as seen by the root process (which can be compared to the measured system time to show there is also MPI setup overhead).

ocaisa avatar Jan 18 '22 12:01 ocaisa

Cool! Could you convert this into a PR? Thanks! :grin:

tkphd avatar Jan 18 '22 18:01 tkphd