nalu-wind icon indicating copy to clipboard operation
nalu-wind copied to clipboard

AMSKernelHex8Mesh.NGP_ams_forcing GPU unit test failure

Open tasmith4 opened this issue 3 years ago • 18 comments

On all nightly GPU test lines that don't include OpenFast, AMSKernelHex8Mesh.NGP_ams_forcing is failing with "terminate called without an active exception". (The OpenFast lines fail differently, due to #979)

tasmith4 avatar Jul 12 '22 13:07 tasmith4

@ddement https://sierra-cdash.sandia.gov/test/198957204 is one of the failing lines

tasmith4 avatar Jul 12 '22 16:07 tasmith4

I was unable to reproduce this on a local build today. Let's see what happens after we update openfast. I don't think we have any cuda builds without openfast that are currently working due to the trilinos@develop build issue.

psakievich avatar Sep 09 '22 20:09 psakievich

@PaulMullowney is this the failing test on crusher?

psakievich avatar Oct 07 '22 02:10 psakievich

Yes

PaulMullowney avatar Oct 07 '22 15:10 PaulMullowney

We are seeing it on the SNL dashboard so it is definitely showing up in cuda too. Link for stk team. @ldh4 @djglaze @ddement .

psakievich avatar Oct 07 '22 22:10 psakievich

This seems like something we can debug locally which should be a lot easier than debugging on crusher. It seems possible that it is the same bug.

alanw0 avatar Oct 07 '22 22:10 alanw0

We are seeing it on the SNL dashboard so it is definitely showing up in cuda too. Link for stk team. @ldh4 @djglaze @ddement .

That's good. If we can consistently reproduce that error with a cuda build, it's going to be so much easier to diagnose the problem. I am currently running cuda-memcheck on it. Hopefully that'll pin-point me to problematic line(s).

ldh4 avatar Oct 07 '22 22:10 ldh4

fingers crossed

PaulMullowney avatar Oct 07 '22 23:10 PaulMullowney

Actually, the failing kernel on crusher is: AMSKernelHex8Mesh.NGP_ams_diff

PaulMullowney avatar Oct 07 '22 23:10 PaulMullowney

The result from cuda-memcheck and compute-sanitizer on AMSKernelHex8Mesh tests came back, showing that the tools did not find any problem at all. I am skeptical about these results telling me that these runs were all clean.

I ran few commands like this: compute-sanitizer --log-file compute_sanitize_output --tool memcheck --leak-check full ./unittestX --gtest_filter=*AMS*

ldh4 avatar Oct 10 '22 15:10 ldh4

Can you run the racecheck tool also?

PaulMullowney avatar Oct 10 '22 15:10 PaulMullowney

Racecheck is also reporting that there's 0 hazards found in AMS tests. I should note that I am somehow not seeing AMSKernelHex8Mesh.NGP_ams_forcing failing when running it on Summit. My build is using 60c1d76af3990d1f0356252fa5f080aca05efd3e.

ldh4 avatar Oct 10 '22 16:10 ldh4

bummer

PaulMullowney avatar Oct 10 '22 17:10 PaulMullowney

New AMS failure today:

[ RUN      ] AMSKernelHex8Mesh.NGP_ams_forcing
cudaEventRecord(CudaInternal::constantMemReusable, cudaStream_t(cuda_instance->m_stream)) error( cudaErrorIllegalAddress): an illegal memory access was encountered /projects/wind/wind-testing-gpu/spack/opt/spack/linux-rhel7-x86_64/gcc-9.3.0/trilinos-develop-ijaxp4sxaaxs6ubec35fjbnvkmbjfusb/include/Cuda/Kokkos_Cuda_KernelLaunch.hpp:561
Backtrace:
                       [0x656ced3]
                       [0x6564098]
                       [0x65640cb]
                       [0x6576cbd]
                        [0xd3e875]
                        [0xa53b40]
                       [0x65a86ad]
                       [0x6596950]
                       [0x6596d92]
                       [0x6596f72]
                       [0x659ef1d]
                       [0x659f46b]
                        [0x5eaec3]
__libc_start_main [0x7f2bb4c68555]
                        [0x795c28]

There is definitely something manifesting in AMS world that is amiss on device.

psakievich avatar Oct 11 '22 15:10 psakievich

Hi all, is this something @jamelvin and I should be helping you with? We don't typically run on GPUs.

marchdf avatar Oct 11 '22 16:10 marchdf

@marchdf I don't think so unless you are interested. This is likely not related to the physics application. If we find a design flaw in the algorithm we'll let you know though.

psakievich avatar Oct 11 '22 16:10 psakievich

Sounds good. The people on this thread are probably better placed to fix ;)

marchdf avatar Oct 11 '22 17:10 marchdf

Reopening for now. Did not intend to close with #1052

psakievich avatar Oct 20 '22 16:10 psakievich

Resolved now

tasmith4 avatar Oct 28 '22 15:10 tasmith4