ompi
ompi copied to clipboard
Add/remove OB1 and CUDA progress.
Provide support for dynamically adding and removing the progress function for OB1 and CUDA.
This will provide a fix for #4650.
Signed-off-by: George Bosilca [email protected]
Hi, George. Responding late. Please accept my apologies. I see an assertion failure with the patch. Seems like there are fewer increment operations than decrement operations.
mpirun -np 2 --hostfile /home/akvenkatesh/osu-micro-benchmarks/build-hsw/hostfile --mca btl vader,self,smcuda,openib /home/akvenkatesh/osu-micro-benchmarks/build-hsw/get_local_ompi_rank /home/akvenkatesh/osu-micro-benchmarks/build-hsw/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency D D
# OSU MPI-CUDA Latency Test v5.3
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Latency (us)
0 1.73
1 18.47
2 18.49
4 18.77
8 18.45
16 18.58
32 18.48
64 18.49
128 19.80
256 19.83
512 19.26
1024 20.38
2048 21.00
4096 22.81
8192 24.99
osu_latency: ../../../../../ompi/mca/pml/ob1/pml_ob1_progress.c:64: mca_pml_ob1_enable_progress: Assertion `progress_count >= 0' failed.
[hsw226:07774] *** Process received signal ***
[hsw226:07774] Signal: Aborted (6)
[hsw226:07774] Signal code: (-6)
[hsw226:07774] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2aaaac28a370]
[hsw226:07774] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2aaaac4cc1d7]
[hsw226:07774] [ 2] /lib64/libc.so.6(abort+0x148)[0x2aaaac4cd8c8]
[hsw226:07774] [ 3] /lib64/libc.so.6(+0x2e146)[0x2aaaac4c5146]
[hsw226:07774] [ 4] /lib64/libc.so.6(+0x2e1f2)[0x2aaaac4c51f2]
[hsw226:07774] [ 5] /home/akvenkatesh/openmpi/bosilca/build/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_enable_progress+0x54)[0x2aaaeb2e6322]
[hsw226:07774] [ 6] /home/akvenkatesh/openmpi/bosilca/build/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_progress+0x1b5)[0x2aaaeb2e6523]
[hsw226:07774] [ 7] /home/akvenkatesh/openmpi/bosilca/build/lib/libopen-pal.so.0(opal_progress+0x30)[0x2aaaad20f4af]
[hsw226:07774] [ 8] /home/akvenkatesh/openmpi/bosilca/build/lib/openmpi/mca_pml_ob1.so(+0xcb6a)[0x2aaaeb2deb6a]
[hsw226:07774] [ 9] /home/akvenkatesh/openmpi/bosilca/build/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv+0x362)[0x2aaaeb2dffe9]
[hsw226:07774] [10] /home/akvenkatesh/openmpi/bosilca/build/lib/libmpi.so.0(MPI_Recv+0x2c0)[0x2aaaabf8592e]
[hsw226:07774] [11] /home/akvenkatesh/osu-micro-benchmarks/build-hsw/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x40155d]
[hsw226:07774] [12] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaac4b8b35]
[hsw226:07774] [13] /home/akvenkatesh/osu-micro-benchmarks/build-hsw/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x401189]
[hsw226:07774] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 7774 on node hsw226 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
The IBM CI (GNU Compiler) build failed! Please review the log, linked below.
Gist: https://gist.github.com/6e3054f922a9b26ba73f13daa04ffa03
The IBM CI (XL Compiler) build failed! Please review the log, linked below.
Gist: https://gist.github.com/647cfc633c9868dc2823da2213996842
The IBM CI (PGI Compiler) build failed! Please review the log, linked below.
Gist: https://gist.github.com/157c0bbcb101530d4cc1fe17246e433e
bot:ibm:retest (CI script broke, should be fixed now)
@bosilca were you able to look at the failure?
Can one of the admins verify this patch?
@bosilca @jladd-mlnx - What's the fate of this PR?
@bosilca Is this something we could revive for HAN for v5.0.0?
Not for HAN. With the new accelerator framework I am not sure if this is still relevant.