cuda-profiler
cuda-profiler copied to clipboard
MPI annotation option does not output any MPI information
Dear Nvprof developers:
I want to use nvprof to profile my cuda+mpi application. But the little test shows that the options --annote-mpi openmpi does not produce any information about MPI interface as described in the nvprof document. The following is the information of example for the test:
Sample Test:
From Link: http://geco.mines.edu/tesla/cuda_tutorial_mio/
Source Files: mpi_hello_gpu.cu, vecadd.cu
OpenMPI Version: 4.0.2
Cuda Version: 10.1
Command: $ mpirun -np 2 nvprof --annotate-mpi openmpi ./mpi_cuda
Output ( using 2 mpi processes): rank 0 of 2 on p3dev02 received bcastme[3]=3 [gpu 0] rank 1 of 2 on p3dev02 received bcastme[3]=3 [gpu 1] ==70253== NVPROF is profiling process 70253, command: ./mpi_cuda ==70254== NVPROF is profiling process 70254, command: ./mpi_cuda rank 0: cudaGetDevice()=0 rank 1: cudaGetDevice()=1 rank 1: C[0]=0.000000 ranksum= 1 ==70253== Profiling application: ./mpi_cuda ==70253== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 62.58% 3.1040us 2 1.5520us 1.3440us 1.7600us [CUDA memcpy HtoD] 37.42% 1.8560us 1 1.8560us 1.8560us 1.8560us [CUDA memcpy DtoH] API calls: 86.74% 352.44ms 3 117.48ms 10.267us 352.42ms cudaMalloc 5.39% 21.910ms 582 37.645us 258ns 2.0794ms cuDeviceGetAttribute 4.75% 19.303ms 50000 386ns 303ns 102.73us cudaLaunchKernel 2.07% 8.3917ms 6 1.3986ms 1.1406ms 1.4661ms cuDeviceTotalMem 0.68% 2.7607ms 1 2.7607ms 2.7607ms 2.7607ms cudaGetDeviceProperties 0.34% 1.3713ms 6 228.55us 215.41us 247.59us cuDeviceGetName 0.02% 66.319us 3 22.106us 14.092us 30.931us cudaMemcpy 0.01% 20.708us 3 6.9020us 1.8690us 16.755us cudaFree 0.00% 12.278us 6 2.0460us 1.3700us 4.3850us cuDeviceGetPCIBusId 0.00% 7.5770us 12 631ns 375ns 973ns cuDeviceGet 0.00% 6.6190us 1 6.6190us 6.6190us 6.6190us cudaSetDevice 0.00% 6.2070us 4 1.5510us 867ns 2.3670us cuPointerGetAttributes 0.00% 2.3390us 6 389ns 354ns 461ns cuDeviceGetUuid 0.00% 1.8280us 3 609ns 437ns 780ns cuDeviceGetCount 0.00% 1.5210us 1 1.5210us 1.5210us 1.5210us cudaGetDevice 0.00% 1.2300us 1 1.2300us 1.2300us 1.2300us cudaGetDeviceCount ==70254== Profiling application: ./mpi_cuda ==70254== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 100.00% 179.83ms 50000 3.5960us 3.5510us 4.0640us vecAdd(float*, float*, float*) 0.00% 3.0400us 2 1.5200us 1.3440us 1.6960us [CUDA memcpy HtoD] 0.00% 2.0480us 1 2.0480us 2.0480us 2.0480us [CUDA memcpy DtoH] API calls: 68.49% 884.64ms 50000 17.692us 16.647us 1.4335ms cudaLaunchKernel 28.85% 372.61ms 3 124.20ms 15.212us 372.57ms cudaMalloc 1.55% 20.003ms 582 34.368us 453ns 1.2518ms cuDeviceGetAttribute 0.76% 9.7675ms 6 1.6279ms 1.6077ms 1.6602ms cuDeviceTotalMem 0.25% 3.2029ms 1 3.2029ms 3.2029ms 3.2029ms cudaGetDeviceProperties 0.10% 1.2356ms 6 205.93us 135.78us 224.53us cuDeviceGetName 0.01% 103.42us 3 34.473us 19.464us 60.273us cudaMemcpy 0.00% 60.895us 3 20.298us 4.2420us 51.665us cudaFree 0.00% 16.364us 4 4.0910us 2.0370us 9.1220us cuPointerGetAttributes 0.00% 14.154us 6 2.3590us 1.9510us 3.1620us cuDeviceGetPCIBusId 0.00% 11.338us 12 944ns 580ns 1.5080us cuDeviceGet 0.00% 7.3840us 1 7.3840us 7.3840us 7.3840us cudaSetDevice 0.00% 3.8410us 6 640ns 592ns 673ns cuDeviceGetUuid 0.00% 2.7020us 3 900ns 699ns 1.0970us cuDeviceGetCount 0.00% 1.9360us 1 1.9360us 1.9360us 1.9360us cudaGetDevice 0.00% 1.2750us 1 1.2750us 1.2750us 1.2750us cudaGetDeviceCount
Hope you can reproduce the issue.
Best, Shelton