AMSKernelHex8Mesh.NGP_ams_forcing GPU unit test failure
On all nightly GPU test lines that don't include OpenFast, AMSKernelHex8Mesh.NGP_ams_forcing is failing with "terminate called without an active exception". (The OpenFast lines fail differently, due to #979)
@ddement https://sierra-cdash.sandia.gov/test/198957204 is one of the failing lines
I was unable to reproduce this on a local build today. Let's see what happens after we update openfast. I don't think we have any cuda builds without openfast that are currently working due to the trilinos@develop build issue.
@PaulMullowney is this the failing test on crusher?
Yes
We are seeing it on the SNL dashboard so it is definitely showing up in cuda too. Link for stk team. @ldh4 @djglaze @ddement .
This seems like something we can debug locally which should be a lot easier than debugging on crusher. It seems possible that it is the same bug.
We are seeing it on the SNL dashboard so it is definitely showing up in cuda too. Link for stk team. @ldh4 @djglaze @ddement .
That's good. If we can consistently reproduce that error with a cuda build, it's going to be so much easier to diagnose the problem. I am currently running cuda-memcheck on it. Hopefully that'll pin-point me to problematic line(s).
fingers crossed
Actually, the failing kernel on crusher is: AMSKernelHex8Mesh.NGP_ams_diff
The result from cuda-memcheck and compute-sanitizer on AMSKernelHex8Mesh tests came back, showing that the tools did not find any problem at all. I am skeptical about these results telling me that these runs were all clean.
I ran few commands like this: compute-sanitizer --log-file compute_sanitize_output --tool memcheck --leak-check full ./unittestX --gtest_filter=*AMS*
Can you run the racecheck tool also?
Racecheck is also reporting that there's 0 hazards found in AMS tests. I should note that I am somehow not seeing AMSKernelHex8Mesh.NGP_ams_forcing failing when running it on Summit. My build is using 60c1d76af3990d1f0356252fa5f080aca05efd3e.
bummer
[ RUN ] AMSKernelHex8Mesh.NGP_ams_forcing
cudaEventRecord(CudaInternal::constantMemReusable, cudaStream_t(cuda_instance->m_stream)) error( cudaErrorIllegalAddress): an illegal memory access was encountered /projects/wind/wind-testing-gpu/spack/opt/spack/linux-rhel7-x86_64/gcc-9.3.0/trilinos-develop-ijaxp4sxaaxs6ubec35fjbnvkmbjfusb/include/Cuda/Kokkos_Cuda_KernelLaunch.hpp:561
Backtrace:
[0x656ced3]
[0x6564098]
[0x65640cb]
[0x6576cbd]
[0xd3e875]
[0xa53b40]
[0x65a86ad]
[0x6596950]
[0x6596d92]
[0x6596f72]
[0x659ef1d]
[0x659f46b]
[0x5eaec3]
__libc_start_main [0x7f2bb4c68555]
[0x795c28]
There is definitely something manifesting in AMS world that is amiss on device.
Hi all, is this something @jamelvin and I should be helping you with? We don't typically run on GPUs.
@marchdf I don't think so unless you are interested. This is likely not related to the physics application. If we find a design flaw in the algorithm we'll let you know though.
Sounds good. The people on this thread are probably better placed to fix ;)
Reopening for now. Did not intend to close with #1052
Resolved now