plumed2 icon indicating copy to clipboard operation
plumed2 copied to clipboard

[Gromacs 2020.2] Issue when trying to run SAXS on GPU

Open RemiLacroix-IDRIS opened this issue 4 years ago • 6 comments

Hello,

One of our users ran into some issues when trying to run Gromacs 2020.2 patched with Plumed 2.6.1 while configuring SAXS to run on GPU (using the ArrayFire library).

You can find a small test case here: https://filesender.renater.fr/?s=download&token=f3a932e9-e74e-4ec4-b0ed-d768551aec5b (link valid until December 23rd).

The error is as follow:

Program:     gmx mdrun, version 2020.2-MODIFIED
Source file: src/gromacs/gpu_utils/devicebuffer.cuh (line 207)
Function:    clearDeviceBufferAsync(ValueType**, size_t, size_t, CommandStream) [with ValueType = float; DeviceBuffer<ValueType> = float*; size_t = long unsigned int; CommandStream = CUstream_st*]::<lambda()>
MPI rank:    1 (out of 4)

Assertion failed:
Condition: stat == cudaSuccess
Couldn't clear the device buffer

The full log is attached: output.log.

There is no error if SAXS runs on CPU (i.e. when removing the GPU keyword from Rep*/plumed-saxsCG.dat files).

There is no error if I set the CUDA_VISIBLE_DEVICES environment variable so that each of the 4 MPI tasks only see one distinct GPU (note that in this case I have to modify the Rep*/plumed-saxsCG.dat files so that DEVICEID=0 is used everywhere).

I hope that you can shed some light on this issue. Let me know if you need more information.

Best regards, Rémi

RemiLacroix-IDRIS avatar Nov 23 '20 19:11 RemiLacroix-IDRIS

Dear Rémi

I don’t know, but my guess is that if all the replica run on the same node, still they try to use all the same GPU (there isn’t an automatic distribution of the GPU at the moment, but it should be possible to learn from GROMACS which GPU is used and use the same (I noted this down for future development). A possibility in alternative to use CUDA_VISIBLE_DEVICES is to sed different DEVICEID to the different replicas using the replica syntax in plumed (I am not sure, but it should work) something like

DEVIDEID=@replicas:{0 1 2 3}

Best

Carlo

On 23 Nov 2020, at 20:15, Rémi Lacroix [email protected] wrote:

Hello,

One of our users ran into some issues when trying to run Gromacs-Plumed with SAXS on GPU.

You can find a small test case here: https://filesender.renater.fr/?s=download&token=f3a932e9-e74e-4ec4-b0ed-d768551aec5b https://filesender.renater.fr/?s=download&token=f3a932e9-e74e-4ec4-b0ed-d768551aec5b (link valid until December 23rd).

The error is as follow:

Program: gmx mdrun, version 2020.2-MODIFIED Source file: src/gromacs/gpu_utils/devicebuffer.cuh (line 207) Function: clearDeviceBufferAsync(ValueType**, size_t, size_t, CommandStream) [with ValueType = float; DeviceBuffer<ValueType> = float*; size_t = long unsigned int; CommandStream = CUstream_st*]::<lambda()> MPI rank: 1 (out of 4)

Assertion failed: Condition: stat == cudaSuccess Couldn't clear the device buffer The full log is attached: output.log https://github.com/plumed/plumed2/files/5585224/output.log.

There is no error if SAXS runs on CPU (i.e. when removing the GPU keyword from Rep*/plumed-saxsCG.dat files).

There is no error if I set the CUDA_VISIBLE_DEVICES environment variable so that each of the 4 MPI tasks only see one distinct GPU (note that in this case I have to modify the Rep*/plumed-saxsCG.dat files so that DEVICEID=0 is used everywhere).

I hope that you can shed some light on this issue. Let me know if you need more information.

Best regards, Rémi

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/plumed/plumed2/issues/651, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABC6L5WWHVVDURNLZR3FFNTSRKYD7ANCNFSM4T74TQ7A.

carlocamilloni avatar Nov 24 '20 08:11 carlocamilloni

Dear Carlo,

Thanks for your answer!

my guess is that if all the replica run on the same node, still they try to use all the same GPU

This is true by default but I made sure to set DEVICEID to a different value in the config file of each replica.

As far as I can tell this is properly passed to ArrayFire since I can see the following in the output log:

...
ArrayFire v3.7.2 (CUDA, 64-bit Linux, build 218dd2c)
Platform: CUDA Runtime 10.2, Driver: 440.64.00
[0] Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0
-1- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0
-2- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0
-3- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0
...
ArrayFire v3.7.2 (CUDA, 64-bit Linux, build 218dd2c)
Platform: CUDA Runtime 10.2, Driver: 440.64.00
-0- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0
-1- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0
[2] Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0
-3- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0
...

A possibility in alternative to use CUDA_VISIBLE_DEVICES is to sed different DEVICEID to the different replicas using the replica syntax in plumed (I am not sure, but it should work) something like DEVIDEID=@replicas:{0 1 2 3}

I can try that but as far as I understand this is another way to use different device ids for each replica.

Best, Rémi

RemiLacroix-IDRIS avatar Nov 24 '20 09:11 RemiLacroix-IDRIS

Ok, so another thing I can think is that maybe the GPU assigned to a replica by plumed does not correspond to gpu used by the corresponding gromacs replica, but if you set them by hand probably this is not the case.

I don’t know… I would need to try to reproduce the bug locally, anyway I am happy there is a workaround because I don’t know when I will have the time to do it

Best, Carlo

On 24 Nov 2020, at 10:32, Rémi Lacroix [email protected] wrote:

Dear Carlo,

Thanks for your answer!

my guess is that if all the replica run on the same node, still they try to use all the same GPU

This is true by default but I made sure to set DEVICEID to a different value in the config file of each replica.

As far as I can tell this is properly passed to ArrayFire since I can see the following in the output log:

... ArrayFire v3.7.2 (CUDA, 64-bit Linux, build 218dd2c) Platform: CUDA Runtime 10.2, Driver: 440.64.00 [0] Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0 -1- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0 -2- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0 -3- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0 ... ArrayFire v3.7.2 (CUDA, 64-bit Linux, build 218dd2c) Platform: CUDA Runtime 10.2, Driver: 440.64.00 -0- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0 -1- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0 [2] Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0 -3- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0 ... A possibility in alternative to use CUDA_VISIBLE_DEVICES is to sed different DEVICEID to the different replicas using the replica syntax in plumed (I am not sure, but it should work) something like DEVIDEID=@replicas https://github.com/replicas:{0 1 2 3}

I can try that but as far as I understand this is another way to use different device ids for each replica.

Best, Rémi

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/plumed/plumed2/issues/651#issuecomment-732773423, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABC6L5UFXARASAZT6RMWW4TSRN4SBANCNFSM4T74TQ7A.

carlocamilloni avatar Nov 24 '20 09:11 carlocamilloni

Ok, so another thing I can think is that maybe the GPU assigned to a replica by plumed does not correspond to gpu used by the corresponding gromacs replica, but if you set them by hand probably this is not the case.

That's a good point actually! Gromacs might not have used the same binding logic as the one I set manually for SAXS GPU. i will double-check that.

RemiLacroix-IDRIS avatar Nov 24 '20 09:11 RemiLacroix-IDRIS

Gromacs might not have used the same binding logic as the one I set manually for SAXS GPU

That's correct, Gromacs seems to be using some weird binding I don't really understand:

  • rank 0 --> GPU 2
  • rank 1 --> GPU 3
  • rank 2 --> GPU 1
  • rank 3 --> GPU 0

That explains the issue.

So I think using by default the same GPU as Gromacs would be a good idea.

RemiLacroix-IDRIS avatar Nov 26 '20 18:11 RemiLacroix-IDRIS

@RemiLacroix-IDRIS Thanks for reporting this, I will try to implement it

carlocamilloni avatar Nov 26 '20 21:11 carlocamilloni

Fixed in #893

carlocamilloni avatar Jan 31 '23 20:01 carlocamilloni

@carlocamilloni : If I understand correctly, this means that if DEVICEID is not set, Plumed will use the same GPU as Gromacs by default?

RemiLacroix-IDRIS avatar Feb 01 '23 09:02 RemiLacroix-IDRIS

@RemiLacroix-IDRIS yes this is how it works (we have tested only on Marconi100 at CINECA where there are 4 GPU per nodes). The only issue is that to compile it you need to set the include to the cuda/include folder (I have not yet modified the configure to do it automatically). Also gromacs needs to be re-patched.

carlocamilloni avatar Feb 07 '23 08:02 carlocamilloni