ompi icon indicating copy to clipboard operation
ompi copied to clipboard

Add/remove OB1 and CUDA progress.

Open bosilca opened this issue 6 years ago • 10 comments

Provide support for dynamically adding and removing the progress function for OB1 and CUDA.

This will provide a fix for #4650.

Signed-off-by: George Bosilca [email protected]

bosilca avatar Apr 14 '18 20:04 bosilca

Hi, George. Responding late. Please accept my apologies. I see an assertion failure with the patch. Seems like there are fewer increment operations than decrement operations.

mpirun -np 2 --hostfile /home/akvenkatesh/osu-micro-benchmarks/build-hsw/hostfile --mca btl vader,self,smcuda,openib /home/akvenkatesh/osu-micro-benchmarks/build-hsw/get_local_ompi_rank /home/akvenkatesh/osu-micro-benchmarks/build-hsw/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency D D
# OSU MPI-CUDA Latency Test v5.3
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size          Latency (us)
0                       1.73
1                      18.47
2                      18.49
4                      18.77
8                      18.45
16                     18.58
32                     18.48
64                     18.49
128                    19.80
256                    19.83
512                    19.26
1024                   20.38
2048                   21.00
4096                   22.81
8192                   24.99
osu_latency: ../../../../../ompi/mca/pml/ob1/pml_ob1_progress.c:64: mca_pml_ob1_enable_progress: Assertion `progress_count >= 0' failed.
[hsw226:07774] *** Process received signal ***
[hsw226:07774] Signal: Aborted (6)
[hsw226:07774] Signal code:  (-6)
[hsw226:07774] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2aaaac28a370]
[hsw226:07774] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2aaaac4cc1d7]
[hsw226:07774] [ 2] /lib64/libc.so.6(abort+0x148)[0x2aaaac4cd8c8]
[hsw226:07774] [ 3] /lib64/libc.so.6(+0x2e146)[0x2aaaac4c5146]
[hsw226:07774] [ 4] /lib64/libc.so.6(+0x2e1f2)[0x2aaaac4c51f2]
[hsw226:07774] [ 5] /home/akvenkatesh/openmpi/bosilca/build/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_enable_progress+0x54)[0x2aaaeb2e6322]
[hsw226:07774] [ 6] /home/akvenkatesh/openmpi/bosilca/build/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_progress+0x1b5)[0x2aaaeb2e6523]
[hsw226:07774] [ 7] /home/akvenkatesh/openmpi/bosilca/build/lib/libopen-pal.so.0(opal_progress+0x30)[0x2aaaad20f4af]
[hsw226:07774] [ 8] /home/akvenkatesh/openmpi/bosilca/build/lib/openmpi/mca_pml_ob1.so(+0xcb6a)[0x2aaaeb2deb6a]
[hsw226:07774] [ 9] /home/akvenkatesh/openmpi/bosilca/build/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv+0x362)[0x2aaaeb2dffe9]
[hsw226:07774] [10] /home/akvenkatesh/openmpi/bosilca/build/lib/libmpi.so.0(MPI_Recv+0x2c0)[0x2aaaabf8592e]
[hsw226:07774] [11] /home/akvenkatesh/osu-micro-benchmarks/build-hsw/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x40155d]
[hsw226:07774] [12] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaac4b8b35]
[hsw226:07774] [13] /home/akvenkatesh/osu-micro-benchmarks/build-hsw/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x401189]
[hsw226:07774] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 7774 on node hsw226 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Akshay-Venkatesh avatar Apr 27 '18 21:04 Akshay-Venkatesh

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/6e3054f922a9b26ba73f13daa04ffa03

ibm-ompi avatar Oct 18 '18 17:10 ibm-ompi

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/647cfc633c9868dc2823da2213996842

ibm-ompi avatar Oct 18 '18 17:10 ibm-ompi

The IBM CI (PGI Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/157c0bbcb101530d4cc1fe17246e433e

ibm-ompi avatar Oct 18 '18 18:10 ibm-ompi

bot:ibm:retest (CI script broke, should be fixed now)

jjhursey avatar Oct 18 '18 18:10 jjhursey

@bosilca were you able to look at the failure?

awlauria avatar Mar 19 '20 14:03 awlauria

Can one of the admins verify this patch?

lanl-ompi avatar Oct 25 '20 21:10 lanl-ompi

@bosilca @jladd-mlnx - What's the fate of this PR?

gpaulsen avatar Mar 02 '21 15:03 gpaulsen

@bosilca Is this something we could revive for HAN for v5.0.0?

gpaulsen avatar Aug 30 '22 19:08 gpaulsen

Not for HAN. With the new accelerator framework I am not sure if this is still relevant.

bosilca avatar Aug 30 '22 22:08 bosilca