dbcsr
dbcsr copied to clipboard
Hit acc_devmem_setzero: failed in dbcsr/acc/dbcsr_acc_devmem.F:445
Describe the bug Hit acc_devmem_setzero: failed in dbcsr/acc/dbcsr_acc_devmem.F:445
To Reproduce Steps to reproduce the behavior: git clone --recursive https://github.com/cp2k/dbcsr.git commit id: f35f901e4460980aa06757294463a1e6308f8dc9 cd dbcsr; mkdir build; cd build; cmake -DTEST_MPI_RANKS=2 -DTEST_OMP_THREADS=2 -DUSE_CUDA=ON -DUSE_CUBLAS=ON -DWITH_GPU=P100 -DCMAKE_BUILD_TYPE=Debug ../ make -j16 OMP_NUM_THREADS=2 mpirun -np 2 ./tests/dbcsr_unittest2
test_name large_blocks_1
numthreads 2
numnodes 2
matrix_sizes 500 500 500
sparsities 0.50000000000000000 0.50000000000000000 0.50000000000000000
alpha (1.0000000000000000,0.0000000000000000)
beta (0.0000000000000000,0.0000000000000000)
limits 1 500 1 500 1 500
retain_sparsity F
bs_m 1 100
bs_n 1 100
bs_k 1 100
CUDA Error: invalid argument
*******************************************************************************
* ___ *
* / \ *
* [ABORT] *
* \___/ acc_devmem_setzero: failed *
* | *
* O/| *
* /| | *
* / \ dbcsr/acc/dbcsr_acc_devmem.F:445 *
*******************************************************************************
===== Routine Calling Stack =====
5 multiply_cannon
4 dbcsr_multiply_generic
3 test_multiply
2 dbcsr_test_multiplies
1 dbcsr_unittest
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
It looks like dbcsr is trying to allocate memory in device 1 but launch devmem with device 0.
Expected behavior It should not occur the cuda fault.
Environment:
- Operating system & version: 18.04
- Compiler vendor & version: gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
- Build environment (make or cmake): cmake
- Configuration of DBCSR (either the cmake flags or the
Makefile.inc
): cmake -DTEST_MPI_RANKS=2 -DTEST_OMP_THREADS=2 -DUSE_CUDA=ON -DUSE_CUBLAS=ON -DWITH_GPU=P100 -DCMAKE_BUILD_TYPE=Debug ../ - MPI implementation and version: openmpi 2.1.1
- If CUDA is being used: CUDA version and GPU architecture: cuda-10.1, P100
- BLAS/LAPACK implementation and version: cublas
- If applicable: Runtime information (how many nodes, type of nodes, ...): run in one node
Thanks, Vitesse.
Could you check by compiling with -DUSE_CUBLAS=OFF
?
It seems an error due to https://github.com/cp2k/dbcsr/issues/229...
Dear Alazzaro,
- I cannot fix this issue when I change the code that you mention in the #229 .
CALL acc_devmem_setzero_bytes(area%d%acc_devmem, s*l+1, s*u, area%d%memory_type%acc_stream)
- It still hit the fail when I disable the cublas. cmake -DTEST_MPI_RANKS=2 -DTEST_OMP_THREADS=2 -DUSE_CUDA=ON -DUSE_CUBLAS=OFF -DWITH_GPU=P100 -DCMAKE_BUILD_TYPE=Debug ../ OMP_NUM_THREADS=2 mpirun -np 2 ./tests/dbcsr_unittest2
- The issue can be repoduced in the version v2.0.0-alpha1.
Thanks, Vitesse.
OK, thanks for the clarification, we are trying to reproduce the problem. Concerning this part:
It looks like dbcsr is trying to allocate memory in device 1 but launch devmem with device 0.
Just to be sure, are you running on a multi-gpu system?
Yes, one node with 2 GPU cards.
Might be a duplicate of #205.
@vitesselin
Unfortunately, we don't have a multi-gpu system to try. We would like to ask you if you could try the following patch for replacing the lines
https://github.com/cp2k/dbcsr/blob/f35f901e4460980aa06757294463a1e6308f8dc9/src/acc/dbcsr_acc_devmem.F#L435-L438
with
offset = 0
length = this%size_in_bytes
IF (PRESENT(first_byte)) THEN
offset = first_byte - 1
length = length - first_byte
ENDIF
IF (PRESENT(last_byte)) THEN
length = last_byte
IF (PRESENT(first_byte)) length = lenght - first_byte
ENDIF
and recompile DBCSR.
@alazzaro It still hits the problem and I also print out the length and offset for it.
> test_name large_blocks_1
> numthreads 2
> numnodes 2
> matrix_sizes 500 500 500
> sparsities 0.50000000000000000 0.50000000000000000 0.50000000000000000
> alpha (1.0000000000000000,0.0000000000000000)
> beta (0.0000000000000000,0.0000000000000000)
> limits 1 500 1 500 1 500
> retain_sparsity F
> bs_m 1 100
> bs_n 1 100
> bs_k 1 100
> length= 300000 offset= 0
> length= 432000 offset= 0
> CUDA Error: invalid argument
> length= 300000 offset= 0
> length= 432000 offset= 0
>
> *******************************************************************************
> * ___ *
> * / \ *
> * [ABORT] *
> * \___/ acc_devmem_setzero: failed *
> * | *
> * O/| *
> * /| | *
> * / \ dbcsr/acc/dbcsr_acc_devmem.F:452 *
> *******************************************************************************
>
>
> ===== Routine Calling Stack =====
>
> 5 multiply_cannon
> 4 dbcsr_multiply_generic
> 3 test_multiply
> 2 dbcsr_test_multiplies
> 1 dbcsr_unittest
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>
@oschuett Yes, it looks like similar to the #205. I observed that DBCSR only did launchkernel in the first device even if I have two or more. Furthermore, DBCSR allocated device memory in each device not in the first device only. This issue happend when GPU0 wants to use the memory that allocated in GPU1,2,3... If this is the scenario that DBCSR wants, DBCSR should enable peer to peer access rights to each other GPU.
One more question, how is the configuration when CP2K and DBCSR doing offical build testing? Single node or multi-nodes? Single GPU in one node or multi-GPU in one node?
I also print the memory when DBCSR is trying to allocate memory. Configuration: one node with 2 GPU cards, OMP=2, MPI=2 We have two processes and four threads in this configuration. Here is the memory allocation in which device.
44734 44734 D VIT myDevice=0
44734 44742 D VIT myDevice=0
44735 44735 D VIT myDevice=0
44735 44744 D VIT myDevice=1
The related code.
diff --git a/src/acc/cuda/acc_cuda_mem.cu b/src/acc/cuda/acc_cuda_mem.cu
index 9db76ae3..471e4383 100644
--- a/src/acc/cuda/acc_cuda_mem.cu
+++ b/src/acc/cuda/acc_cuda_mem.cu
@@ -12,6 +12,8 @@
#include <math.h>
#include "acc_cuda_error.h"
#include "../include/acc.h"
+#include <unistd.h>
+#include <sys/syscall.h>
static const int verbose_print = 0;
@@ -19,6 +21,10 @@ static const int verbose_print = 0;
/****************************************************************************/
extern "C" int acc_dev_mem_allocate(void **dev_mem, size_t n){
cudaError_t cErr;
+ int myDevice;
+
+ cErr = cudaGetDevice (&myDevice);
+ printf ("%d %d D VIT myDevice=%d\n", getpid(), syscall(SYS_gettid), myDevice);
Few clarifications:
- we don't test multi-gpu configurations (see point 3. too)
- I don't think #205 is related to this bug in the first place. #205 is when you use the library in combination with other libraries that uses the GPUs, for instance in CP2K when it uses Sirius. In your case, you are running the library alone with a unittest
- the way DBCSR uses the multi-gpu is via multi-ranks, where each rank is attached to a GPU, this is done in a round-robin fashion. So, in your case, you have 2 ranks, so rank0<->gpu0 and rank1<->gpu1. For the particular unittest2, the code which does this is
https://github.com/cp2k/dbcsr/blob/f35f901e4460980aa06757294463a1e6308f8dc9/tests/dbcsr_unittest2.F#L69-L70
This is done at the beginning of the execution. There are no other calls in DBCSR where we change the device. So, unless we lost the device for that rank, the GPUs matching to ranks should be preserved... But now you are showing that:
44734 44734 D VIT myDevice=0
44734 44742 D VIT myDevice=0
44735 44735 D VIT myDevice=0
44735 44744 D VIT myDevice=1
This is definitely wrong, one of the rank (44735) is trying to access both devices! I wonder how it can happen... I understand we need to reset the GPU before each multiplication, but it is still not clear to me how and where things are getting wrong in unittest2...
I think I got the answer to the problem. Sorry for my ignorance in CUDA threading:
https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/
@vitesselin I assume that 1 thread the code should work fine?
@alazzaro
I assume that 1 thread the code should work fine?
Yes, the issue is always found in process 2 with thread id 1. For the CUDA threading problem, I have no idea where should I add the SetDevice in the code.
By the way, I also try this in CP2K with 2GPUs in one node. There are a lot of similar error in regression test so that I think it is similar to the #205. I can workaround this by adding peer2peer access right for each GPU.
Given the fact that it is suggested to set the GPU even per thread (my understanding was that each thread should keep the same GPU of the rank, it turns out I was wrong), I think we can to set the GPU before any CUDA call... At this point, I wonder if this is something new... we never had a similar problem in the past (even if it was barely tested)... @shoshijak Have you seen a similar problem on multi-gpus nodes?
Still, #205 is something else, unless you are using other GPU libraries in CP2K (currently only Sirius...)
Concerning your idea of peer2peer access right for each GPU, yes, it can be doable by setting
export CUDA_VISIBLE_DEVICES=0 --> rank 0 export CUDA_VISIBLE_DEVICES=1 --> rank 1
Is this you have in mind?
@alazzaro
Concerning your idea of peer2peer access right for each GPU, yes, it can be doable by setting
export CUDA_VISIBLE_DEVICES=0 --> rank 0 export CUDA_VISIBLE_DEVICES=1 --> rank 1
Is this you have in mind?
No, I mean to use the api "cudaDeviceCanAccessPeer " to enable the access right to other GPU's resource. You can refer the link.
cudaSetDevice(currentDeviceId);
cudaDeviceCanAccessPeer(&isCanAccessPeer, currentDeviceId, peerDeviceId);
By the way, this is just a workaround.
@alazzaro Is there any plan of DBCSR to support the Multi-GPU in one node in the neer future?
Sorry for chiming in a little late.
At this point, I wonder if this is something new... we never had a similar problem in the past (even if it was barely tested)... @shoshijak Have you seen a similar problem on multi-gpus nodes?
I've only worked on 1 multi-gpu node system (Summit), and there, we did what you suggested, namely assign 1 GPU to each MPI rank. To do so, we didn't rely on dbcsr_acc_set_active_device
as is done at the top of dbcsr_unittest2
. Rather, this was set directly in jsrun (the job launcher).
Although we did have issues on Summit, I believe we did not encounter this one in particular, but I cannot verify this.
I can workaround this by adding peer2peer access right for each GPU.
Adding Peer Device Memory Access to DBCSR seems to me like a lot of work, which would complexify DBCSR's code a lot, and I'm not sure about the returns.
@vitesselin I'm not sure where you are running your code: can you run via a job launcher (e.g. srun, aprun, jsrun), which would assign each GPU to an MPI rank? I'd be curious to know whether you still observe this bug there.
In anny case, I'll try to see if I can reproduce this error. It seems like something is not quite working the way it should ...
Adding Peer Device Memory Access to DBCSR seems to me like a lot of work, which would complexify DBCSR's code a lot, and I'm not sure about the returns.
IIRC, it should be rather straight forward to assign OpenMP threads to different GPUs.
I believe this will be an important feature in the near future, where we'll see HPC systems with fewer and bigger nodes. In other words, the interconnect between cores becomes increasingly non-uniform and we can no longer tread MPI and OpenMP on the same footing.
... I guess this topic merits an issue on its own.
@alazzaro Is there any plan of DBCSR to support the Multi-GPU in one node in the neer future?
@vitesselin Assuming now other issues && Assuming I can run on a multigpu/node and reproduce your problem (I got access to a system, I have to try there), then I would say, based on what we discussed, I have a plan on how to fix the problem. I want to be conservative: 2 weeks for a fix.
In the meantime, I suggest to use a single OpenMP and use multiple MPI per each GPU (via MPS).
I think CUDA has no association between streams and devices. A single stream could direct work towards multiple devices. What about storing the device-ID within the stream and actually requiring 1:1 wrt stream<->device?
@hfp Yes, this is the idea... I must say, we don't need to save per stream, we need to save per rank and apply the setdevice any time we call a CUDA function... Clearly, we don't allow to change the rank<->device mapping during the execution...
I think it's convenient to store it on a per-stream basis, this also seals the change within the ACC backend.
Still, each stream within a rank cannot only access to a single device, so all stream will have the same value... Does the ACC implementation allow to have streams with access to different GPU?
so all stream will have the same value
There are not many streams used by CP2K/DBCSR (no big duplication accept one additional integer value for every stream). The alternative to store it in DBCSR (and perhaps in CP2K in some places) would not save much, but worse -- calling set_active_device would be scattered in both source code bases. If it's stored on a per-stream basis, every acc-backend function that takes a stream can call set_active_device with the device-ID given by the stream (and thereby seal this task inside of the backend).
Does the ACC implementation allow to have streams with access to different GPU?
I don't think so. I believe it's an implicitly assumed 1:1-relation, but we can ask Ole. The real problem is that streams may exist on a per-thread basis. The problem is not the data duplication but unnecessarily calling set_active_device in every thread. On the other hand, this is also part of the reported problem if I am not mistaken, and likely not worse compared to the current implementation.
IIRC, the scheme goes as follows: Each thread owns a bunch of buffers. Each buffer is associated with a stream.
Back in the days GPUs only had around 8 hardware queues. One could create more streams but they would get assigned to the same hardware queue. Since we were relying on stream priorities, we were afraid to leave this assignment up to chance and decided to only create 8 streams and share them among the buffers. We never verified if this limitation of streams was actually necessary. All this logic lives in dbcsr_mm_accdrv_init.
For supporting multiple GPUs per MPI rank, I'd suggest to assign each thread and stream to a GPU. Then each thread has to call set_active_device()
only once. Threads that get assigned to the same GPU could continue to share streams, but that might not actually be necessary.
Concerning the streams, what you did (@oschuett ) is still OK. NIVIDA didn't change that much since then... There are still 2 copy engines/compute with PASCAL, probably you have more with VOLTA, but we still saturate the PCI bandwidth, so no much gain in optimizing it (I did several times in the past).
From what I understand, the problem is that "whenever" you create a thread, you have to assign the GPUid. That means I have to understand all the places where we use threads for the GPU and add a setdevice
. For instance, I can have
OMP start assign ID ... threaded code GPU... OMP end
...bla bla code GPU...
OMP start assign ID ... threaded code GPU... OMP end
This is a bit risky, that's why I propose in the first place to set the device before any GPU call.
@hfp Yes, the idea is to have the device ID within the dbcsr_configuration, then we can fetch the info...
@alazzaro that means you plan to open a parallel region (OpenMP) during initialization and call SetDevice for every thread. This must be true for farming as well (or any potential fork with an own copy of the OpenMP runtime). Anyhow, this would in general allow to share a certain stream across devices. I am just saying that SYCL for instance adopted a different model, where a stream is associated with a single device. OpenMP/offload can deal with any of the models, we just have to decide which one we want.
I believe, OpenMP threads continue to exist across parallel regions. Hence, the threads should also remember their assigned active GPU across parallel regions.
@oschuett I'm far to be an expert, I'm just reporting what it is written here
https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/
Quoting the bottom line:
To save yourself from a variety of multithreading bugs, remember: always call cudaSetDevice() first when you spawn a new host thread
@oschuett Now that you're rephrasing this, I believe it's not true in general. If the OpenMP implementation would not use a pool of threads (or give workers up after a while), it would probably break.
@alazzaro You mentioned this before.
the way DBCSR uses the multi-gpu is via multi-ranks, where each rank is attached to a GPU, this is done in a round-robin fashion.
If I disable the DBCSR-ACC, will CP2K support multi-gpu within one node or not?
Thanks, Vitesse.
No idea about CP2K... Probably @oschuett knows better?
In principle (never tried), you can drop GPU DBCSR by removing -D__DBCSR_ACC
in the arch file.