ucx
ucx copied to clipboard
openmpi pml ucx cannot be selected when linking to cuda object file
Describe the bug
The full error message is:
$ # UCX at master
$ mpirun -np 2 --npernode 2 --mca btl ^openib,smcuda --mca pml ucx --mca pml_ucx_devices any --mca pml_ucx_tls any -x LD_LIBRARY_PATH $PWD/a.out
[1660949109.951339] [prm-dgx-10:9574 :0] select.c:634 UCX ERROR no copy across memory types transport to prm-dgx-10:9574: self/memory - Destination is unreachable, tcp/lo - no put short, tcp/ib2 - no put short, tcp/ib0 - no put short, tcp/enp1s0f0 - no put short, tcp/ib3 - no put short, tcp/ib1 - no put short, sysv/memory - no memory registration, posix/memory - no memory
[1660949109.951338] [prm-dgx-10:9575 :0] select.c:634 UCX ERROR no copy across memory types transport to prm-dgx-10:9575: self/memory - Destination is unreachable, tcp/lo - no put short, tcp/ib2 - no put short, tcp/ib0 - no put short, tcp/enp1s0f0 - no put short, tcp/ib3 - no put short, tcp/ib1 - no put short, sysv/memory - no memory registration, posix/memory - no memory
[prm-dgx-10:09575] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:309 Error: Failed to create UCP worker
[prm-dgx-10:09574] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:309 Error: Failed to create UCP worker
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.
This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.
Host: prm-dgx-10
Framework: pml
--------------------------------------------------------------------------
[prm-dgx-10:09575] PML ucx cannot be selected
[prm-dgx-10:09574] PML ucx cannot be selected
[prm-dgx-10:09570] 1 more process has sent help message help-mca-base.txt / find-available:none found
[prm-dgx-10:09570] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Older commit of UCX where minor version was bumped to 1.14 doesn't have this issue.
$ # UCX at commit 313ba9f7fded9688096660245ae1c574ab6bdd73
$ mpirun -np 2 --npernode 2 --mca btl ^openib,smcuda --mca pml ucx --mca pml_ucx_devices any --mca pml_ucx_tls any -x LD_LIBRARY_PATH $PWD/a.out
rank 0 of 2
rank 1 of 2
UCX_LOG_LEVEL=TRACE shows this:
cuda_copy/cuda : not suitable for copy across memory types, no host
Steps to Reproduce
- Command line
// multiply.cu
#include <cuda.h>
#include <cuda_runtime.h>
__global__ void __multiply__(float *a, float *b)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
b[i] *= a[i];
}
extern "C" void launch_multiply(float *a, float *b)
{
if (a == NULL || b == NULL) return;
__multiply__<<<1, 1, 0>>>(a, b);
return;
}
// main.c
#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>
void launch_multiply(const float *a, float *b);
int main(int argc, char **argv)
{
int rank, nprocs;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
fprintf(stdout, "rank %d of %d\n", rank, nprocs);
launch_multiply(NULL, NULL);
MPI_Finalize();
return 0;
}
- Compilation and run
$ nvcc -c multiply.cu -o multiply.o -L$CUDA_HOME/lib64 -lcuda -lcudart -I$CUDA_HOME/include -I$UCX_HOME/include -L$UCX_HOME/lib -luct -lucp -lucs -lucm
$ mpicc -c main.c -o main.o -L$MPI_HOME/lib -lmpi -I$MPI_HOME/include
$ mpicc main.o multiply.o -L$CUDA_HOME/lib64 -lcuda -lcudart -I$CUDA_HOME/include -lstdc++
$ mpirun -np 2 --npernode 2 --mca btl ^openib,smcuda --mca pml ucx --mca pml_ucx_devices any --mca pml_ucx_tls any -x LD_LIBRARY_PATH $PWD/a.out
Setup and versions
- DGX-1v with cuda 11.7
Additional information (depending on the issue)
- OpenMPI version 4.1.4
@yosefe any thoughts on this?
@Akshay-Venkatesh can you bisect the commit that caused it?
@yosefe Seems like this 4c55b2c60288d58440b312b072329cc7633dd10e introduced the issue. cc @rakhmets
Ok, I will take a look.
@yosefe @rakhmets Based on the offline discussion, I did try and catch potential error returned by the kernel launch by the following changes in multiply.cu:
#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>
__global__ void __multiply__(float *a, float *b)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
b[i] *= a[i];
}
extern "C" void launch_multiply(float *a, float *b)
{
cudaError_t cuda_err;
if (a == NULL || b == NULL) return;
__multiply__<<<1, 1, 0>>>(a, b);
cuda_err = cudaGetLastError();
if (cudaSuccess != cuda_err) {
fprintf(stderr, "cuda err: %s\n", cudaGetErrorString(cuda_err));
exit(-1);
} else {
printf("no cuda errors in kernel launch\n");
}
return;
}
But not surprisingly the original error still shows up because it happens in worker_create phase (which is before the kernel is even launched; in fact the kernel can never be launched in my test because I call launch_multiply(NULL, NULL) in main and this should simply return).
As I mentioned offline, the error goes away when I include -arch sm_70
to nvcc compiler options.
nvcc -arch sm_70 -c multiply.cu -o multiply.o -L$CUDA_HOME/lib64 -lcuda -lcudart -I$CUDA_HOME/include
mpicc -c main.c -o main.o -L$MPI_HOME/lib -lmpi -I$MPI_HOME/include
mpicc main.o multiply.o -L$CUDA_HOME/lib64 -lcuda -lcudart -I$CUDA_HOME/include -lstdc++
@Akshay-Venkatesh to summarize our offline discussion let's PR an entry to https://github.com/openucx/ucx/blob/master/docs/source/faq.md#working-with-gpu and/or README.md about this issue, and in same PR, cuda_copy transport should print the PTX error message with log level error, even if hide_errors is set.
@yosefe FYI, I created https://github.com/openucx/ucx/pull/8569 to address the issue.
@Akshay-Venkatesh to summarize our offline discussion let's PR an entry to https://github.com/openucx/ucx/blob/master/docs/source/faq.md#working-with-gpu and/or README.md about this issue, and in same PR, cuda_copy transport should print the PTX error message with log level error, even if hide_errors is set.