hpc-intro
hpc-intro copied to clipboard
Simplifying the parallel application example
Based on our discussion yesterday, I created a simplified version of an mpi4py example that exactly reproduces Amdahls law. You could use it as a black box, or explain the code.
#!/usr/bin/env python
"""
Amdahls law illustrator (with fake work)
"""
from mpi4py import MPI
import sys
import time
import argparse
def do_work(work_time=30, parallel_proportion=0.5, comm=MPI.COMM_WORLD):
# How many MPI ranks (cores) are we?
size = comm.Get_size()
# Who am I in that set of ranks?
rank = comm.Get_rank()
# Where am I running?
name = MPI.Get_processor_name()
if rank == 0:
# use Amdahls law to calculate the expected speedup for a given workload
amdahl_speed_up = 1.0 / (
(1.0 - parallel_proportion) + parallel_proportion / size
)
# Set the sleep times (which are used to fake the amount of work)
serial_sleep_time = float(work_time) * (1.0 - parallel_proportion)
parallel_sleep_time = (float(work_time) * parallel_proportion) / size
sys.stdout.write(
"Processors will do %s seconds of 'work', which should take %s seconds "
"on %s cores with %s parallel proportion of the workload.\n"
% (work_time, work_time / amdahl_speed_up, size, parallel_proportion)
)
sys.stdout.write(
"Hello, World! I am process %d of %d on %s and I will do all the serial "
"'work' for %s seconds.\n" % (rank, size, name, serial_sleep_time)
)
time.sleep(serial_sleep_time)
else:
parallel_sleep_time = None
# Tell all processes how much work they need to do using 'bcast' to broadcast
# (this also creates an implicit barrier, blocking processes until they recieve
# the value)
parallel_sleep_time = comm.bcast(parallel_sleep_time, root=0)
# This is where everyone pretends to do work (while really we are just sleeping)
sys.stdout.write(
"Hello, World! I am process %d of %d on %s and I will do parallel 'work' for "
"%s seconds.\n" % (rank, size, name, parallel_sleep_time)
)
time.sleep(parallel_sleep_time)
# Only the root process handles the command line arguments
rank = MPI.COMM_WORLD.Get_rank()
if rank == 0:
# Start a clock to measure total time
start = time.time()
# Initialize our argument parser
parser = argparse.ArgumentParser()
# Adding optional arguments
parser.add_argument(
"-p",
"--parallel-proportion",
nargs="?",
const=0.5,
type=float,
default=0.5,
help="Parallel proportion should be a float between 0 and 1",
)
parser.add_argument(
"-w",
"--work-seconds",
nargs="?",
const=30,
type=int,
default=30,
help="Total seconds of workload, should be an integer greater than 0",
)
# Read arguments from command line
args = parser.parse_args()
if not args.work_seconds > 0:
parser.print_help()
MPI.COMM_WORLD.Abort(1)
sys.exit(1)
if args.parallel_proportion <= 0 or args.parallel_proportion > 1:
parser.print_help()
MPI.COMM_WORLD.Abort(1)
sys.exit(1)
do_work(work_time=args.work_seconds, parallel_proportion=args.parallel_proportion)
end = time.time()
sys.stdout.write(
"Total exection time (according to rank 0): %s seconds" % (end - start)
)
else:
do_work()
This would give output like
[ocaisa@gpunode1 ~]$ time mpirun --oversubscribe -n 10 python amdahl.py -p 0.8 -w 20
Processors will do 20 seconds of 'work', which should take 5.6 seconds on 10 cores with 0.8 parallel proportion of the workload.
Hello, World! I am process 0 of 10 on gpunode1.int.eessi-gpu.learnhpc.eu and I will do all the serial 'work' for 3.999999999999999 seconds.
Hello, World! I am process 1 of 10 on gpunode1.int.eessi-gpu.learnhpc.eu and I will do parallel 'work' for 1.6 seconds.
Hello, World! I am process 0 of 10 on gpunode1.int.eessi-gpu.learnhpc.eu and I will do parallel 'work' for 1.6 seconds.
Hello, World! I am process 2 of 10 on gpunode1.int.eessi-gpu.learnhpc.eu and I will do parallel 'work' for 1.6 seconds.
Hello, World! I am process 7 of 10 on gpunode1.int.eessi-gpu.learnhpc.eu and I will do parallel 'work' for 1.6 seconds.
Hello, World! I am process 3 of 10 on gpunode1.int.eessi-gpu.learnhpc.eu and I will do parallel 'work' for 1.6 seconds.
Hello, World! I am process 4 of 10 on gpunode1.int.eessi-gpu.learnhpc.eu and I will do parallel 'work' for 1.6 seconds.
Hello, World! I am process 5 of 10 on gpunode1.int.eessi-gpu.learnhpc.eu and I will do parallel 'work' for 1.6 seconds.
Hello, World! I am process 6 of 10 on gpunode1.int.eessi-gpu.learnhpc.eu and I will do parallel 'work' for 1.6 seconds.
Hello, World! I am process 8 of 10 on gpunode1.int.eessi-gpu.learnhpc.eu and I will do parallel 'work' for 1.6 seconds.
Hello, World! I am process 9 of 10 on gpunode1.int.eessi-gpu.learnhpc.eu and I will do parallel 'work' for 1.6 seconds.
Total exection time (according to rank 0): 5.650804281234741 seconds
real 0m8.510s
user 0m4.143s
sys 0m7.540s
so would also exhibit that there is other overhead to consider. You can oversubscribe this to your hearts content since it is not actually doing any calculations.
Hmm, I think I can do better here, I can set the actual amount of serial time for the root (default is 0.5 * work_time
), and distribute a trivial parallel sleep time (default would be (0.5 * work_time)/n_proc
).
Nice example. Probably one could also use this for parallel prefix scan, or parallel merge sort algorithms if one wants to demonstrate something other than sleep. Embarrassingly parallel algorithms are less good for this.
Sure, but my point here is not an the algorithm at all (that would be just another thing that would have to be explained), it's to show that something can be made of serial and parallel work, and illustrate Amdahls law as a result.
I have updated the code in the initial comment to take command line args and also show the total time as seen by the root process (which can be compared to the measured system time to show there is also MPI setup overhead).
Cool! Could you convert this into a PR? Thanks! :grin: