gpusph icon indicating copy to clipboard operation
gpusph copied to clipboard

illegal memory access was encountered

Open koparasy opened this issue 6 years ago • 13 comments

Hello, I am compiling gpusph with "make" then I execute " make ProblemExample". And then I execute ./GPUSPH and I get the following error:

Device 0 thread 140735808663952 iteration 0 last command: 7. Exception: src/cuda/forces.cu(516) : in unbind_textures() @ thread 0x140735808663952 : cudaSafeCall() runtime API error 77 : an illegal memory access was encountered

The same error rises also when executing different problems.

My system information is the following: g++ : (GCC) 6.4.0 nvcc : release 9.1, V9.1.85 GPU devices: 4 x Tesla V100-SXM2.

koparasy avatar Mar 22 '19 12:03 koparasy

Hello @koparasy

which GPUSPH version (or branch) does this happen with?

Oblomov avatar Mar 23 '19 09:03 Oblomov

Hello @Oblomov

I am working on the main branch.

koparasy avatar Mar 25 '19 16:03 koparasy

Hello @koparasy

can you please try the next branch and see if it's fixed there already?

Oblomov avatar Mar 26 '19 08:03 Oblomov

I checkout to the next branch and the error persists.

koparasy avatar Mar 26 '19 14:03 koparasy

I'm afraid I'm unable to reproduce the issue locally. Can you see if running GPUSPH under cuda-memcheck gives some indiciation of where the illegal access is coming from?

Oblomov avatar Mar 27 '19 10:03 Oblomov

Also: are you running single- or multi-GPU?

Oblomov avatar Mar 27 '19 10:03 Oblomov

The error appears on single and multi GPU runs. I attach you the output of a run with a single gpu without mpi support without hdf5 support.

I attach the output of the cuda-memcheck. cudaMemCheck.txt

koparasy avatar Mar 28 '19 10:03 koparasy

Thanks for the report. From the log, it would seem that the issue happens when the forcesDevice kernel tries to fetch the neighbors position, but the array where it's trying to read from should be valid. I do not have any GPU with Compute Capability 7.0 and cannot reproduce the error on my machine, so I'm afraid debugging will be a bit slow and you'll have to be my hands and eyes 8-)

For starters, I would recommend to updated to the latest next that I just pushed, which includes a small fix for neighbors traversal. I don't think it's directly relevant to the case, but we never know.

if the latest next (currently at commit add5af07) doesn't fix the issue, I would ask you to try the following change: in src/cuda/textures.cuh, replace the line:

#if __COMPUTE__ >= 20 && __COMPUTE__/10 != 3

with

#if 1

and see if you can replicate the error, and then again replacing it with

#if 0

and see if you can replicate the error. This should help us pinpoint a bit better the possible source of error.

Oblomov avatar Mar 28 '19 14:03 Oblomov

Had the same issue on a 6.1 CUDA device. I am actually working on the wsl branch but I did merge that fix from the 'next' branch. I also tried the fix in textures.cuh, none worked.

anthropoy avatar Jul 26 '19 04:07 anthropoy

Can you please provide the output of make show? It should be available as info/show.txt ready for export if you're on a recent enough branch.

Oblomov avatar Jul 26 '19 06:07 Oblomov

show.txt

As attached, thanks.

anthropoy avatar Jul 29 '19 07:07 anthropoy

The Microsoft compiler suffers from this bug, which affects GPUSPH. A large part of the changes introduced in the wsl branch are specifically to work around this, but it seems you're hitting a case we missed. We'll look into it.

Oblomov avatar Jul 29 '19 07:07 Oblomov

I see, thanks for looking into this.

anthropoy avatar Jul 29 '19 08:07 anthropoy