ucx openmpi pml ucx cannot be selected when linking to cuda object file

Describe the bug

The full error message is:

$ # UCX at master
$ mpirun -np 2 --npernode 2 --mca btl ^openib,smcuda --mca pml ucx --mca pml_ucx_devices any --mca pml_ucx_tls any -x LD_LIBRARY_PATH $PWD/a.out
[1660949109.951339] [prm-dgx-10:9574 :0]          select.c:634  UCX  ERROR   no copy across memory types transport to prm-dgx-10:9574: self/memory - Destination is unreachable, tcp/lo - no put short, tcp/ib2 - no put short, tcp/ib0 - no put short, tcp/enp1s0f0 - no put short, tcp/ib3 - no put short, tcp/ib1 - no put short, sysv/memory - no memory registration, posix/memory - no memory
[1660949109.951338] [prm-dgx-10:9575 :0]          select.c:634  UCX  ERROR   no copy across memory types transport to prm-dgx-10:9575: self/memory - Destination is unreachable, tcp/lo - no put short, tcp/ib2 - no put short, tcp/ib0 - no put short, tcp/enp1s0f0 - no put short, tcp/ib3 - no put short, tcp/ib1 - no put short, sysv/memory - no memory registration, posix/memory - no memory
[prm-dgx-10:09575] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:309  Error: Failed to create UCP worker
[prm-dgx-10:09574] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:309  Error: Failed to create UCP worker
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      prm-dgx-10
  Framework: pml
--------------------------------------------------------------------------
[prm-dgx-10:09575] PML ucx cannot be selected
[prm-dgx-10:09574] PML ucx cannot be selected
[prm-dgx-10:09570] 1 more process has sent help message help-mca-base.txt / find-available:none found
[prm-dgx-10:09570] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Older commit of UCX where minor version was bumped to 1.14 doesn't have this issue.

$ # UCX at commit 313ba9f7fded9688096660245ae1c574ab6bdd73
$ mpirun -np 2 --npernode 2 --mca btl ^openib,smcuda --mca pml ucx --mca pml_ucx_devices any --mca pml_ucx_tls any -x LD_LIBRARY_PATH $PWD/a.out
rank 0 of 2
rank 1 of 2

UCX_LOG_LEVEL=TRACE shows this:

cuda_copy/cuda : not suitable for copy across memory types, no host

Steps to Reproduce

Command line

// multiply.cu

#include <cuda.h>
#include <cuda_runtime.h>

__global__ void __multiply__(float *a, float *b)
{
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    b[i] *= a[i];
}

extern "C" void launch_multiply(float *a, float *b)
{
    if (a == NULL || b == NULL) return;
    __multiply__<<<1, 1, 0>>>(a, b);

    return;
}

// main.c

#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>

void launch_multiply(const float *a, float *b);

int main(int argc, char **argv)
{
    int rank, nprocs;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

    fprintf(stdout, "rank %d of %d\n", rank, nprocs);

    launch_multiply(NULL, NULL);

    MPI_Finalize();

    return 0;
}

Compilation and run

$ nvcc -c multiply.cu -o multiply.o -L$CUDA_HOME/lib64 -lcuda -lcudart -I$CUDA_HOME/include -I$UCX_HOME/include -L$UCX_HOME/lib -luct -lucp -lucs -lucm
$ mpicc -c main.c -o main.o -L$MPI_HOME/lib -lmpi -I$MPI_HOME/include
$ mpicc main.o multiply.o -L$CUDA_HOME/lib64 -lcuda -lcudart -I$CUDA_HOME/include -lstdc++
$ mpirun -np 2 --npernode 2 --mca btl ^openib,smcuda --mca pml ucx --mca pml_ucx_devices any --mca pml_ucx_tls any -x LD_LIBRARY_PATH $PWD/a.out

Setup and versions

DGX-1v with cuda 11.7

Additional information (depending on the issue)

OpenMPI version 4.1.4

Aug 19 '22 23:08 Akshay-Venkatesh

@yosefe any thoughts on this?

Aug 23 '22 16:08 Akshay-Venkatesh

@Akshay-Venkatesh can you bisect the commit that caused it?

Aug 23 '22 16:08 yosefe

@yosefe Seems like this 4c55b2c60288d58440b312b072329cc7633dd10e introduced the issue. cc @rakhmets

Aug 23 '22 18:08 Akshay-Venkatesh

Ok, I will take a look.

Aug 25 '22 06:08 rakhmets

@yosefe @rakhmets Based on the offline discussion, I did try and catch potential error returned by the kernel launch by the following changes in multiply.cu:

#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>

__global__ void __multiply__(float *a, float *b)
{
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    b[i] *= a[i];
}

extern "C" void launch_multiply(float *a, float *b)
{
    cudaError_t cuda_err;
    if (a == NULL || b == NULL) return;
    __multiply__<<<1, 1, 0>>>(a, b);

    cuda_err = cudaGetLastError();

    if (cudaSuccess != cuda_err) {
        fprintf(stderr, "cuda err: %s\n", cudaGetErrorString(cuda_err));
        exit(-1);
    } else {
        printf("no cuda errors in kernel launch\n");  
    }


    return;
}

But not surprisingly the original error still shows up because it happens in worker_create phase (which is before the kernel is even launched; in fact the kernel can never be launched in my test because I call launch_multiply(NULL, NULL) in main and this should simply return).

As I mentioned offline, the error goes away when I include -arch sm_70 to nvcc compiler options.

nvcc -arch sm_70 -c multiply.cu -o multiply.o -L$CUDA_HOME/lib64 -lcuda -lcudart -I$CUDA_HOME/include
mpicc -c main.c -o main.o -L$MPI_HOME/lib -lmpi -I$MPI_HOME/include
mpicc main.o multiply.o -L$CUDA_HOME/lib64 -lcuda -lcudart -I$CUDA_HOME/include -lstdc++

Sep 08 '22 18:09 Akshay-Venkatesh

@Akshay-Venkatesh to summarize our offline discussion let's PR an entry to https://github.com/openucx/ucx/blob/master/docs/source/faq.md#working-with-gpu and/or README.md about this issue, and in same PR, cuda_copy transport should print the PTX error message with log level error, even if hide_errors is set.

Sep 09 '22 13:09 yosefe

@yosefe FYI, I created https://github.com/openucx/ucx/pull/8569 to address the issue.

@Akshay-Venkatesh to summarize our offline discussion let's PR an entry to https://github.com/openucx/ucx/blob/master/docs/source/faq.md#working-with-gpu and/or README.md about this issue, and in same PR, cuda_copy transport should print the PTX error message with log level error, even if hide_errors is set.

Sep 27 '22 19:09 Akshay-Venkatesh

ucx ucx copied to clipboard

openmpi pml ucx cannot be selected when linking to cuda object file

Describe the bug

Steps to Reproduce

Setup and versions

Additional information (depending on the issue)

ucx
ucx copied to clipboard