plumed2
plumed2 copied to clipboard
[Gromacs 2020.2] Issue when trying to run SAXS on GPU
Hello,
One of our users ran into some issues when trying to run Gromacs 2020.2 patched with Plumed 2.6.1 while configuring SAXS to run on GPU (using the ArrayFire library).
You can find a small test case here: https://filesender.renater.fr/?s=download&token=f3a932e9-e74e-4ec4-b0ed-d768551aec5b (link valid until December 23rd).
The error is as follow:
Program: gmx mdrun, version 2020.2-MODIFIED
Source file: src/gromacs/gpu_utils/devicebuffer.cuh (line 207)
Function: clearDeviceBufferAsync(ValueType**, size_t, size_t, CommandStream) [with ValueType = float; DeviceBuffer<ValueType> = float*; size_t = long unsigned int; CommandStream = CUstream_st*]::<lambda()>
MPI rank: 1 (out of 4)
Assertion failed:
Condition: stat == cudaSuccess
Couldn't clear the device buffer
The full log is attached: output.log.
There is no error if SAXS runs on CPU (i.e. when removing the GPU
keyword from Rep*/plumed-saxsCG.dat
files).
There is no error if I set the CUDA_VISIBLE_DEVICES
environment variable so that each of the 4 MPI tasks only see one distinct GPU (note that in this case I have to modify the Rep*/plumed-saxsCG.dat
files so that DEVICEID=0
is used everywhere).
I hope that you can shed some light on this issue. Let me know if you need more information.
Best regards, Rémi
Dear Rémi
I don’t know, but my guess is that if all the replica run on the same node, still they try to use all the same GPU (there isn’t an automatic distribution of the GPU at the moment, but it should be possible to learn from GROMACS which GPU is used and use the same (I noted this down for future development). A possibility in alternative to use CUDA_VISIBLE_DEVICES is to sed different DEVICEID to the different replicas using the replica syntax in plumed (I am not sure, but it should work) something like
DEVIDEID=@replicas:{0 1 2 3}
Best
Carlo
On 23 Nov 2020, at 20:15, Rémi Lacroix [email protected] wrote:
Hello,
One of our users ran into some issues when trying to run Gromacs-Plumed with SAXS on GPU.
You can find a small test case here: https://filesender.renater.fr/?s=download&token=f3a932e9-e74e-4ec4-b0ed-d768551aec5b https://filesender.renater.fr/?s=download&token=f3a932e9-e74e-4ec4-b0ed-d768551aec5b (link valid until December 23rd).
The error is as follow:
Program: gmx mdrun, version 2020.2-MODIFIED Source file: src/gromacs/gpu_utils/devicebuffer.cuh (line 207) Function: clearDeviceBufferAsync(ValueType**, size_t, size_t, CommandStream) [with ValueType = float; DeviceBuffer<ValueType> = float*; size_t = long unsigned int; CommandStream = CUstream_st*]::<lambda()> MPI rank: 1 (out of 4)
Assertion failed: Condition: stat == cudaSuccess Couldn't clear the device buffer The full log is attached: output.log https://github.com/plumed/plumed2/files/5585224/output.log.
There is no error if SAXS runs on CPU (i.e. when removing the GPU keyword from Rep*/plumed-saxsCG.dat files).
There is no error if I set the CUDA_VISIBLE_DEVICES environment variable so that each of the 4 MPI tasks only see one distinct GPU (note that in this case I have to modify the Rep*/plumed-saxsCG.dat files so that DEVICEID=0 is used everywhere).
I hope that you can shed some light on this issue. Let me know if you need more information.
Best regards, Rémi
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/plumed/plumed2/issues/651, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABC6L5WWHVVDURNLZR3FFNTSRKYD7ANCNFSM4T74TQ7A.
Dear Carlo,
Thanks for your answer!
my guess is that if all the replica run on the same node, still they try to use all the same GPU
This is true by default but I made sure to set DEVICEID to a different value in the config file of each replica.
As far as I can tell this is properly passed to ArrayFire since I can see the following in the output log:
...
ArrayFire v3.7.2 (CUDA, 64-bit Linux, build 218dd2c)
Platform: CUDA Runtime 10.2, Driver: 440.64.00
[0] Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0
-1- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0
-2- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0
-3- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0
...
ArrayFire v3.7.2 (CUDA, 64-bit Linux, build 218dd2c)
Platform: CUDA Runtime 10.2, Driver: 440.64.00
-0- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0
-1- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0
[2] Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0
-3- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0
...
A possibility in alternative to use CUDA_VISIBLE_DEVICES is to sed different DEVICEID to the different replicas using the replica syntax in plumed (I am not sure, but it should work) something like DEVIDEID=@replicas:{0 1 2 3}
I can try that but as far as I understand this is another way to use different device ids for each replica.
Best, Rémi
Ok, so another thing I can think is that maybe the GPU assigned to a replica by plumed does not correspond to gpu used by the corresponding gromacs replica, but if you set them by hand probably this is not the case.
I don’t know… I would need to try to reproduce the bug locally, anyway I am happy there is a workaround because I don’t know when I will have the time to do it
Best, Carlo
On 24 Nov 2020, at 10:32, Rémi Lacroix [email protected] wrote:
Dear Carlo,
Thanks for your answer!
my guess is that if all the replica run on the same node, still they try to use all the same GPU
This is true by default but I made sure to set DEVICEID to a different value in the config file of each replica.
As far as I can tell this is properly passed to ArrayFire since I can see the following in the output log:
... ArrayFire v3.7.2 (CUDA, 64-bit Linux, build 218dd2c) Platform: CUDA Runtime 10.2, Driver: 440.64.00 [0] Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0 -1- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0 -2- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0 -3- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0 ... ArrayFire v3.7.2 (CUDA, 64-bit Linux, build 218dd2c) Platform: CUDA Runtime 10.2, Driver: 440.64.00 -0- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0 -1- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0 [2] Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0 -3- Tesla V100-SXM2-16GB, 16161 MB, CUDA Compute 7.0 ... A possibility in alternative to use CUDA_VISIBLE_DEVICES is to sed different DEVICEID to the different replicas using the replica syntax in plumed (I am not sure, but it should work) something like DEVIDEID=@replicas https://github.com/replicas:{0 1 2 3}
I can try that but as far as I understand this is another way to use different device ids for each replica.
Best, Rémi
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/plumed/plumed2/issues/651#issuecomment-732773423, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABC6L5UFXARASAZT6RMWW4TSRN4SBANCNFSM4T74TQ7A.
Ok, so another thing I can think is that maybe the GPU assigned to a replica by plumed does not correspond to gpu used by the corresponding gromacs replica, but if you set them by hand probably this is not the case.
That's a good point actually! Gromacs might not have used the same binding logic as the one I set manually for SAXS GPU. i will double-check that.
Gromacs might not have used the same binding logic as the one I set manually for SAXS GPU
That's correct, Gromacs seems to be using some weird binding I don't really understand:
- rank 0 --> GPU 2
- rank 1 --> GPU 3
- rank 2 --> GPU 1
- rank 3 --> GPU 0
That explains the issue.
So I think using by default the same GPU as Gromacs would be a good idea.
@RemiLacroix-IDRIS Thanks for reporting this, I will try to implement it
Fixed in #893
@carlocamilloni : If I understand correctly, this means that if DEVICEID
is not set, Plumed will use the same GPU as Gromacs by default?
@RemiLacroix-IDRIS yes this is how it works (we have tested only on Marconi100 at CINECA where there are 4 GPU per nodes). The only issue is that to compile it you need to set the include to the cuda/include folder (I have not yet modified the configure to do it automatically). Also gromacs needs to be re-patched.