illegal memory access was encountered
Hello, I am compiling gpusph with "make" then I execute " make ProblemExample". And then I execute ./GPUSPH and I get the following error:
Device 0 thread 140735808663952 iteration 0 last command: 7. Exception: src/cuda/forces.cu(516) : in unbind_textures() @ thread 0x140735808663952 : cudaSafeCall() runtime API error 77 : an illegal memory access was encountered
The same error rises also when executing different problems.
My system information is the following: g++ : (GCC) 6.4.0 nvcc : release 9.1, V9.1.85 GPU devices: 4 x Tesla V100-SXM2.
Hello @koparasy
which GPUSPH version (or branch) does this happen with?
Hello @Oblomov
I am working on the main branch.
Hello @koparasy
can you please try the next branch and see if it's fixed there already?
I checkout to the next branch and the error persists.
I'm afraid I'm unable to reproduce the issue locally. Can you see if running GPUSPH under cuda-memcheck gives some indiciation of where the illegal access is coming from?
Also: are you running single- or multi-GPU?
The error appears on single and multi GPU runs. I attach you the output of a run with a single gpu without mpi support without hdf5 support.
I attach the output of the cuda-memcheck. cudaMemCheck.txt
Thanks for the report. From the log, it would seem that the issue happens when the forcesDevice kernel tries to fetch the neighbors position, but the array where it's trying to read from should be valid. I do not have any GPU with Compute Capability 7.0 and cannot reproduce the error on my machine, so I'm afraid debugging will be a bit slow and you'll have to be my hands and eyes 8-)
For starters, I would recommend to updated to the latest next that I just pushed, which includes a small fix for neighbors traversal. I don't think it's directly relevant to the case, but we never know.
if the latest next (currently at commit add5af07) doesn't fix the issue, I would ask you to try the following change: in src/cuda/textures.cuh, replace the line:
#if __COMPUTE__ >= 20 && __COMPUTE__/10 != 3
with
#if 1
and see if you can replicate the error, and then again replacing it with
#if 0
and see if you can replicate the error. This should help us pinpoint a bit better the possible source of error.
Had the same issue on a 6.1 CUDA device. I am actually working on the wsl branch but I did merge that fix from the 'next' branch. I also tried the fix in textures.cuh, none worked.
Can you please provide the output of make show? It should be available as info/show.txt ready for export if you're on a recent enough branch.
The Microsoft compiler suffers from this bug, which affects GPUSPH. A large part of the changes introduced in the wsl branch are specifically to work around this, but it seems you're hitting a case we missed. We'll look into it.
I see, thanks for looking into this.