dbcsr Hit acc_devmem_setzero: failed in dbcsr/acc/dbcsr_acc

Describe the bug Hit acc_devmem_setzero: failed in dbcsr/acc/dbcsr_acc_devmem.F:445

To Reproduce Steps to reproduce the behavior: git clone --recursive https://github.com/cp2k/dbcsr.git commit id: f35f901e4460980aa06757294463a1e6308f8dc9 cd dbcsr; mkdir build; cd build; cmake -DTEST_MPI_RANKS=2 -DTEST_OMP_THREADS=2 -DUSE_CUDA=ON -DUSE_CUBLAS=ON -DWITH_GPU=P100 -DCMAKE_BUILD_TYPE=Debug ../ make -j16 OMP_NUM_THREADS=2 mpirun -np 2 ./tests/dbcsr_unittest2

 test_name large_blocks_1
 numthreads           2
 numnodes           2
 matrix_sizes         500         500         500
 sparsities  0.50000000000000000       0.50000000000000000       0.50000000000000000
 alpha               (1.0000000000000000,0.0000000000000000)
 beta               (0.0000000000000000,0.0000000000000000)
 limits           1         500           1         500           1         500
 retain_sparsity F
 bs_m           1         100
 bs_n           1         100
 bs_k           1         100
CUDA Error: invalid argument

 *******************************************************************************
 *   ___                                                                       *
 *  /   \                                                                      *
 * [ABORT]                                                                     *
 *  \___/                       acc_devmem_setzero: failed                     *
 *    |                                                                        *
 *  O/|                                                                        *
 * /| |                                                                        *
 * / \                                        dbcsr/acc/dbcsr_acc_devmem.F:445 *
 *******************************************************************************


 ===== Routine Calling Stack =====

            5 multiply_cannon
            4 dbcsr_multiply_generic
            3 test_multiply
            2 dbcsr_test_multiplies
            1 dbcsr_unittest

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

It looks like dbcsr is trying to allocate memory in device 1 but launch devmem with device 0.

Expected behavior It should not occur the cuda fault.

Environment:

Operating system & version: 18.04
Compiler vendor & version: gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Build environment (make or cmake): cmake
Configuration of DBCSR (either the cmake flags or the Makefile.inc): cmake -DTEST_MPI_RANKS=2 -DTEST_OMP_THREADS=2 -DUSE_CUDA=ON -DUSE_CUBLAS=ON -DWITH_GPU=P100 -DCMAKE_BUILD_TYPE=Debug ../
MPI implementation and version: openmpi 2.1.1
If CUDA is being used: CUDA version and GPU architecture: cuda-10.1, P100
BLAS/LAPACK implementation and version: cublas
If applicable: Runtime information (how many nodes, type of nodes, ...): run in one node

Thanks, Vitesse.

Nov 20 '19 10:11 vitesselin

Could you check by compiling with -DUSE_CUBLAS=OFF? It seems an error due to https://github.com/cp2k/dbcsr/issues/229...

Nov 20 '19 10:11 alazzaro

Dear Alazzaro,

I cannot fix this issue when I change the code that you mention in the #229 . CALL acc_devmem_setzero_bytes(area%d%acc_devmem, s*l+1, s*u, area%d%memory_type%acc_stream)
It still hit the fail when I disable the cublas. cmake -DTEST_MPI_RANKS=2 -DTEST_OMP_THREADS=2 -DUSE_CUDA=ON -DUSE_CUBLAS=OFF -DWITH_GPU=P100 -DCMAKE_BUILD_TYPE=Debug ../ OMP_NUM_THREADS=2 mpirun -np 2 ./tests/dbcsr_unittest2
The issue can be repoduced in the version v2.0.0-alpha1.

Thanks, Vitesse.

Nov 20 '19 14:11 vitesselin

OK, thanks for the clarification, we are trying to reproduce the problem. Concerning this part:

It looks like dbcsr is trying to allocate memory in device 1 but launch devmem with device 0.

Just to be sure, are you running on a multi-gpu system?

Nov 20 '19 14:11 alazzaro

Yes, one node with 2 GPU cards.

Nov 20 '19 15:11 vitesselin

Might be a duplicate of #205.

Nov 20 '19 16:11 oschuett

@vitesselin

Unfortunately, we don't have a multi-gpu system to try. We would like to ask you if you could try the following patch for replacing the lines

https://github.com/cp2k/dbcsr/blob/f35f901e4460980aa06757294463a1e6308f8dc9/src/acc/dbcsr_acc_devmem.F#L435-L438

with

offset = 0
length = this%size_in_bytes
IF (PRESENT(first_byte)) THEN
   offset = first_byte - 1
   length = length - first_byte
ENDIF
IF (PRESENT(last_byte)) THEN
   length = last_byte
   IF (PRESENT(first_byte)) length = lenght - first_byte
ENDIF

and recompile DBCSR.

Nov 20 '19 17:11 alazzaro

@alazzaro It still hits the problem and I also print out the length and offset for it.

> test_name large_blocks_1
>  numthreads           2
>  numnodes           2
>  matrix_sizes         500         500         500
>  sparsities  0.50000000000000000       0.50000000000000000       0.50000000000000000
>  alpha               (1.0000000000000000,0.0000000000000000)
>  beta               (0.0000000000000000,0.0000000000000000)
>  limits           1         500           1         500           1         500
>  retain_sparsity F
>  bs_m           1         100
>  bs_n           1         100
>  bs_k           1         100
>  length=               300000 offset=                    0
>  length=               432000 offset=                    0
> CUDA Error: invalid argument
>  length=               300000 offset=                    0
>  length=               432000 offset=                    0
> 
>  *******************************************************************************
>  *   ___                                                                       *
>  *  /   \                                                                      *
>  * [ABORT]                                                                     *
>  *  \___/                       acc_devmem_setzero: failed                     *
>  *    |                                                                        *
>  *  O/|                                                                        *
>  * /| |                                                                        *
>  * / \                                        dbcsr/acc/dbcsr_acc_devmem.F:452 *
>  *******************************************************************************
> 
> 
>  ===== Routine Calling Stack =====
> 
>             5 multiply_cannon
>             4 dbcsr_multiply_generic
>             3 test_multiply
>             2 dbcsr_test_multiplies
>             1 dbcsr_unittest
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>

Nov 21 '19 03:11 vitesselin

@oschuett Yes, it looks like similar to the #205. I observed that DBCSR only did launchkernel in the first device even if I have two or more. Furthermore, DBCSR allocated device memory in each device not in the first device only. This issue happend when GPU0 wants to use the memory that allocated in GPU1,2,3... If this is the scenario that DBCSR wants, DBCSR should enable peer to peer access rights to each other GPU.

One more question, how is the configuration when CP2K and DBCSR doing offical build testing? Single node or multi-nodes? Single GPU in one node or multi-GPU in one node?

I also print the memory when DBCSR is trying to allocate memory. Configuration: one node with 2 GPU cards, OMP=2, MPI=2 We have two processes and four threads in this configuration. Here is the memory allocation in which device.

44734 44734 D VIT myDevice=0
44734 44742 D VIT myDevice=0
44735 44735 D VIT myDevice=0
44735 44744 D VIT myDevice=1

The related code.

diff --git a/src/acc/cuda/acc_cuda_mem.cu b/src/acc/cuda/acc_cuda_mem.cu
index 9db76ae3..471e4383 100644
--- a/src/acc/cuda/acc_cuda_mem.cu
+++ b/src/acc/cuda/acc_cuda_mem.cu
@@ -12,6 +12,8 @@
 #include <math.h>
 #include "acc_cuda_error.h"
 #include "../include/acc.h"
+#include <unistd.h>
+#include <sys/syscall.h>

 static const int verbose_print = 0;

@@ -19,6 +21,10 @@ static const int verbose_print = 0;
 /****************************************************************************/
 extern "C" int acc_dev_mem_allocate(void **dev_mem, size_t n){
   cudaError_t cErr;
+  int myDevice;
+
+  cErr = cudaGetDevice (&myDevice);
+  printf ("%d %d D VIT myDevice=%d\n", getpid(), syscall(SYS_gettid), myDevice);

Nov 21 '19 03:11 vitesselin

Few clarifications:

we don't test multi-gpu configurations (see point 3. too)
I don't think #205 is related to this bug in the first place. #205 is when you use the library in combination with other libraries that uses the GPUs, for instance in CP2K when it uses Sirius. In your case, you are running the library alone with a unittest
the way DBCSR uses the multi-gpu is via multi-ranks, where each rank is attached to a GPU, this is done in a round-robin fashion. So, in your case, you have 2 ranks, so rank0<->gpu0 and rank1<->gpu1. For the particular unittest2, the code which does this is

https://github.com/cp2k/dbcsr/blob/f35f901e4460980aa06757294463a1e6308f8dc9/tests/dbcsr_unittest2.F#L69-L70

This is done at the beginning of the execution. There are no other calls in DBCSR where we change the device. So, unless we lost the device for that rank, the GPUs matching to ranks should be preserved... But now you are showing that:

44734 44734 D VIT myDevice=0
44734 44742 D VIT myDevice=0
44735 44735 D VIT myDevice=0
44735 44744 D VIT myDevice=1

This is definitely wrong, one of the rank (44735) is trying to access both devices! I wonder how it can happen... I understand we need to reset the GPU before each multiplication, but it is still not clear to me how and where things are getting wrong in unittest2...

Nov 21 '19 08:11 alazzaro

I think I got the answer to the problem. Sorry for my ignorance in CUDA threading:

https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/

@vitesselin I assume that 1 thread the code should work fine?

Nov 21 '19 08:11 alazzaro

@alazzaro

I assume that 1 thread the code should work fine?

Yes, the issue is always found in process 2 with thread id 1. For the CUDA threading problem, I have no idea where should I add the SetDevice in the code.

By the way, I also try this in CP2K with 2GPUs in one node. There are a lot of similar error in regression test so that I think it is similar to the #205. I can workaround this by adding peer2peer access right for each GPU.

Nov 21 '19 08:11 vitesselin

Given the fact that it is suggested to set the GPU even per thread (my understanding was that each thread should keep the same GPU of the rank, it turns out I was wrong), I think we can to set the GPU before any CUDA call... At this point, I wonder if this is something new... we never had a similar problem in the past (even if it was barely tested)... @shoshijak Have you seen a similar problem on multi-gpus nodes?

Still, #205 is something else, unless you are using other GPU libraries in CP2K (currently only Sirius...)

Concerning your idea of peer2peer access right for each GPU, yes, it can be doable by setting

export CUDA_VISIBLE_DEVICES=0 --> rank 0 export CUDA_VISIBLE_DEVICES=1 --> rank 1

Is this you have in mind?

Nov 21 '19 08:11 alazzaro

@alazzaro

Concerning your idea of peer2peer access right for each GPU, yes, it can be doable by setting

export CUDA_VISIBLE_DEVICES=0 --> rank 0 export CUDA_VISIBLE_DEVICES=1 --> rank 1

Is this you have in mind?

No, I mean to use the api "cudaDeviceCanAccessPeer " to enable the access right to other GPU's resource. You can refer the link.

cudaSetDevice(currentDeviceId);
cudaDeviceCanAccessPeer(&isCanAccessPeer, currentDeviceId, peerDeviceId);

By the way, this is just a workaround.

Nov 21 '19 09:11 vitesselin

@alazzaro Is there any plan of DBCSR to support the Multi-GPU in one node in the neer future?

Nov 22 '19 03:11 vitesselin

Sorry for chiming in a little late.

At this point, I wonder if this is something new... we never had a similar problem in the past (even if it was barely tested)... @shoshijak Have you seen a similar problem on multi-gpus nodes?

I've only worked on 1 multi-gpu node system (Summit), and there, we did what you suggested, namely assign 1 GPU to each MPI rank. To do so, we didn't rely on dbcsr_acc_set_active_device as is done at the top of dbcsr_unittest2. Rather, this was set directly in jsrun (the job launcher). Although we did have issues on Summit, I believe we did not encounter this one in particular, but I cannot verify this.

I can workaround this by adding peer2peer access right for each GPU.

Adding Peer Device Memory Access to DBCSR seems to me like a lot of work, which would complexify DBCSR's code a lot, and I'm not sure about the returns.

@vitesselin I'm not sure where you are running your code: can you run via a job launcher (e.g. srun, aprun, jsrun), which would assign each GPU to an MPI rank? I'd be curious to know whether you still observe this bug there.

In anny case, I'll try to see if I can reproduce this error. It seems like something is not quite working the way it should ...

Nov 22 '19 10:11 shoshijak

Adding Peer Device Memory Access to DBCSR seems to me like a lot of work, which would complexify DBCSR's code a lot, and I'm not sure about the returns.

IIRC, it should be rather straight forward to assign OpenMP threads to different GPUs.

I believe this will be an important feature in the near future, where we'll see HPC systems with fewer and bigger nodes. In other words, the interconnect between cores becomes increasingly non-uniform and we can no longer tread MPI and OpenMP on the same footing.

... I guess this topic merits an issue on its own.

Nov 22 '19 11:11 oschuett

@alazzaro Is there any plan of DBCSR to support the Multi-GPU in one node in the neer future?

@vitesselin Assuming now other issues && Assuming I can run on a multigpu/node and reproduce your problem (I got access to a system, I have to try there), then I would say, based on what we discussed, I have a plan on how to fix the problem. I want to be conservative: 2 weeks for a fix.

In the meantime, I suggest to use a single OpenMP and use multiple MPI per each GPU (via MPS).

Nov 22 '19 12:11 alazzaro

I think CUDA has no association between streams and devices. A single stream could direct work towards multiple devices. What about storing the device-ID within the stream and actually requiring 1:1 wrt stream<->device?

Nov 22 '19 14:11 hfp

@hfp Yes, this is the idea... I must say, we don't need to save per stream, we need to save per rank and apply the setdevice any time we call a CUDA function... Clearly, we don't allow to change the rank<->device mapping during the execution...

Nov 22 '19 14:11 alazzaro

I think it's convenient to store it on a per-stream basis, this also seals the change within the ACC backend.

Nov 22 '19 14:11 hfp

Still, each stream within a rank cannot only access to a single device, so all stream will have the same value... Does the ACC implementation allow to have streams with access to different GPU?

Nov 22 '19 14:11 alazzaro

so all stream will have the same value

There are not many streams used by CP2K/DBCSR (no big duplication accept one additional integer value for every stream). The alternative to store it in DBCSR (and perhaps in CP2K in some places) would not save much, but worse -- calling set_active_device would be scattered in both source code bases. If it's stored on a per-stream basis, every acc-backend function that takes a stream can call set_active_device with the device-ID given by the stream (and thereby seal this task inside of the backend).

Does the ACC implementation allow to have streams with access to different GPU?

I don't think so. I believe it's an implicitly assumed 1:1-relation, but we can ask Ole. The real problem is that streams may exist on a per-thread basis. The problem is not the data duplication but unnecessarily calling set_active_device in every thread. On the other hand, this is also part of the reported problem if I am not mistaken, and likely not worse compared to the current implementation.

Nov 22 '19 16:11 hfp

IIRC, the scheme goes as follows: Each thread owns a bunch of buffers. Each buffer is associated with a stream.

Back in the days GPUs only had around 8 hardware queues. One could create more streams but they would get assigned to the same hardware queue. Since we were relying on stream priorities, we were afraid to leave this assignment up to chance and decided to only create 8 streams and share them among the buffers. We never verified if this limitation of streams was actually necessary. All this logic lives in dbcsr_mm_accdrv_init.

For supporting multiple GPUs per MPI rank, I'd suggest to assign each thread and stream to a GPU. Then each thread has to call set_active_device() only once. Threads that get assigned to the same GPU could continue to share streams, but that might not actually be necessary.

Nov 22 '19 17:11 oschuett

Concerning the streams, what you did (@oschuett ) is still OK. NIVIDA didn't change that much since then... There are still 2 copy engines/compute with PASCAL, probably you have more with VOLTA, but we still saturate the PCI bandwidth, so no much gain in optimizing it (I did several times in the past).

From what I understand, the problem is that "whenever" you create a thread, you have to assign the GPUid. That means I have to understand all the places where we use threads for the GPU and add a setdevice. For instance, I can have

OMP start assign ID ... threaded code GPU... OMP end

...bla bla code GPU...

OMP start assign ID ... threaded code GPU... OMP end

This is a bit risky, that's why I propose in the first place to set the device before any GPU call.

@hfp Yes, the idea is to have the device ID within the dbcsr_configuration, then we can fetch the info...

Nov 22 '19 17:11 alazzaro

@alazzaro that means you plan to open a parallel region (OpenMP) during initialization and call SetDevice for every thread. This must be true for farming as well (or any potential fork with an own copy of the OpenMP runtime). Anyhow, this would in general allow to share a certain stream across devices. I am just saying that SYCL for instance adopted a different model, where a stream is associated with a single device. OpenMP/offload can deal with any of the models, we just have to decide which one we want.

Nov 22 '19 17:11 hfp

I believe, OpenMP threads continue to exist across parallel regions. Hence, the threads should also remember their assigned active GPU across parallel regions.

Nov 22 '19 18:11 oschuett

@oschuett I'm far to be an expert, I'm just reporting what it is written here

https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/

Quoting the bottom line:

To save yourself from a variety of multithreading bugs, remember: always call cudaSetDevice() first when you spawn a new host thread

Nov 22 '19 18:11 alazzaro

@oschuett Now that you're rephrasing this, I believe it's not true in general. If the OpenMP implementation would not use a pool of threads (or give workers up after a while), it would probably break.

Nov 22 '19 18:11 hfp

@alazzaro You mentioned this before.

the way DBCSR uses the multi-gpu is via multi-ranks, where each rank is attached to a GPU, this is done in a round-robin fashion.

If I disable the DBCSR-ACC, will CP2K support multi-gpu within one node or not?

Thanks, Vitesse.

Nov 26 '19 02:11 vitesselin

No idea about CP2K... Probably @oschuett knows better? In principle (never tried), you can drop GPU DBCSR by removing -D__DBCSR_ACC in the arch file.

Nov 26 '19 04:11 alazzaro

dbcsr dbcsr copied to clipboard

Hit acc_devmem_setzero: failed in dbcsr/acc/dbcsr_acc_devmem.F:445

dbcsr
dbcsr copied to clipboard