LULESH icon indicating copy to clipboard operation
LULESH copied to clipboard

Volume error When running with cuda and mpi

Open koparasy opened this issue 6 years ago • 4 comments

I am running Lulesh on a single node with 160 cpus and 4 (Tesla V100-SXM2) gpus. I am using openmpi-3.0.0 with cuda cuda 9.1. I execute the following command: mpirun -n 27 ./lulesh -s 60 and I get the following error: Rank 22: Volume Error in cell 211619 at iteration 14 The error appears in different number of iterations on each execution. Any idea what is causing this error?

koparasy avatar Mar 21 '19 13:03 koparasy

@koparasy do you always see the same cell as the problem or does that change from time to time. Some of the CUDA versions have an unidentified race condition. I believe the fix since no one was able to find it was to synchronize after each kernel.

Note this code was developed by Nvidia and is not officially maintained. I will reach out to them and see what the fix was and if they can provide anything.

ikarlin avatar Mar 27 '19 21:03 ikarlin

@ikarlin, No the cell id as well as the iteration number change on different executions.

koparasy avatar Mar 28 '19 10:03 koparasy

@koparasy thanks. I have confirmed with Nvidia this is the known race condition. We are discussing the best way to get the fix into the code. Do you have a timeline you need this done on? That might influence our choice.

ikarlin avatar Mar 28 '19 13:03 ikarlin

I'm having the same issue. Is the race condition solved now?

HenryYihengXu avatar Apr 16 '21 01:04 HenryYihengXu