Trilinos icon indicating copy to clipboard operation
Trilinos copied to clipboard

Stokhos: Test Stokhos_KokkosViewUQPCEUnitTest_Serial_MPI_1 randomly failing in 'ats2' CUDA PR build on 'vortex'

Open bartlettroscoe opened this issue 2 years ago • 1 comments

SUMMARY:

CC: @trilinos/stokhos

Next Action Status

Description

As shown in this query (click "Shown Matching Output" in upper right) the test:

  • Stokhos_KokkosViewUQPCEUnitTest_Serial_MPI_1

in the builds:

  • PR-10472-test-ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables-911
  • PR-10472-test-ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables-915
  • PR-10571-test-ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables-1113
  • PR-11086-test-ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables-1182
  • PR-11099-test-ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables-1211

started failing on testing day 2022-05-01.

When the test fails, it produces error output like shown here showing:

3. Kokkos_View_PCE_DS_LayoutLeft_DeepCopy_NonContiguous_UnitTest ... 
 val = 2.21341409336878452e-321 == val_expected = 1.01000000000000000e+02 : FAILED ==> /vscratch1/trilinos/jaas/workspace/Trilinos_PR_cuda-10.1.243/Trilinos/packages/stokhos/test/UnitTest/Stokhos_KokkosViewUQPCEUnitTest.hpp:149
 val = 3.56221330651538758e-321 == val_expected = 1.01099999999999994e+02 : FAILED ==> /vscratch1/trilinos/jaas/workspace/Trilinos_PR_cuda-10.1.243/Trilinos/packages/stokhos/test/UnitTest/Stokhos_KokkosViewUQPCEUnitTest.hpp:149
 val = 6.95327277181438017e-310 == val_expected = 1.01200000000000003e+02 : FAILED ==> /vscratch1/trilinos/jaas/workspace/Trilinos_PR_cuda-10.1.243/Trilinos/packages/stokhos/test/UnitTest/Stokhos_KokkosViewUQPCEUnitTest.hpp:149
...
 val = 4.79243676466009148e-322 == val_expected = 1.02718181818181819e+02 : FAILED ==> /vscratch1/trilinos/jaas/workspace/Trilinos_PR_cuda-10.1.243/Trilinos/packages/stokhos/test/UnitTest/Stokhos_KokkosViewUQPCEUnitTest.hpp:149
 val = 1.01909090909090907e+02 == val_expected = 1.01909090909090907e+02 : passed
 val = 1.02009090909090901e+02 == val_expected = 1.02009090909090901e+02 : passed
 val = 1.02109090909090909e+02 == val_expected = 1.02109090909090909e+02 : passed
 val = 1.02209090909090904e+02 == val_expected = 1.02209090909090904e+02 : passed
 val = 1.02309090909090912e+02 == val_expected = 1.02309090909090912e+02 : passed
 val = 1.02409090909090907e+02 == val_expected = 1.02409090909090907e+02 : passed
 val = 1.02509090909090901e+02 == val_expected = 1.02509090909090901e+02 : passed
 val = 1.02609090909090909e+02 == val_expected = 1.02609090909090909e+02 : passed
 val = 1.02709090909090904e+02 == val_expected = 1.02709090909090904e+02 : passed
 val = 1.02809090909090912e+02 == val_expected = 1.02809090909090912e+02 : passed
 [FAILED]  (0.00153 sec) Kokkos_View_PCE_DS_LayoutLeft_DeepCopy_NonContiguous_UnitTest
 Location: /vscratch1/trilinos/jaas/workspace/Trilinos_PR_cuda-10.1.243/Trilinos/packages/stokhos/test/UnitTest/Stokhos_KokkosViewUQPCEUnitTest.hpp:266

Current Status on CDash

Run the above query adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day.

Steps to Reproduce

Follow instructions at:

  • https://github.com/trilinos/Trilinos/wiki/Reproducing-PR-Testing-Errors

or see:

  • https://gitlab-ex.sandia.gov/rabartl/run_trilinos_pr_builds/-/blob/master/README.ats2.md

for specific instructions on how to build and run on 'vortex'.

bartlettroscoe avatar Oct 06 '22 03:10 bartlettroscoe

FYI: This failure took out my last PR build iteration https://github.com/trilinos/Trilinos/pull/11099#issuecomment-1269142489 (see https://github.com/trilinos/Trilinos/pull/11099#issuecomment-1269249698).

bartlettroscoe avatar Oct 06 '22 03:10 bartlettroscoe

So far I have not been able to reproduce this, either on the ATS2 platform or on a regular Linux platform (note the failing test is running with the Serial execution space, so whatever is going on isn't related to CUDA). I've also tried running the test under valgrind and with the clang address sanitizer. Both came up empty.

etphipp avatar Oct 25 '22 19:10 etphipp

@etphipp, it was reported at the TUG today that Sacado might have some undefined memory issues. Does this use DFAD or the reverse AD types?

bartlettroscoe avatar Oct 26 '22 00:10 bartlettroscoe

@etphipp, it was reported at the TUG today that Sacado might have some undefined memory issues. Does this use DFAD or the reverse AD types?

Yes. It is issue #7741. I never saw it because the team mention was invalid (which is probably a frighteningly common mistake due to the extra characters in the suggested team mention in the Trilinos issue template). I'm working on it now and believe I might have it fixed. It is due to the horribly designed memory management in RAD.

etphipp avatar Oct 26 '22 00:10 etphipp

I never saw it because the team mention was invalid (which is probably a frighteningly common mistake due to the extra characters in the suggested team mention in the Trilinos issue template).

I may be mistaken, but I believe that users who are not in the Trilinos Github group cannot tag individual Trilinos teams. This is why with a lot of recent issues you will see @cgcgcg working hard to tag the correct Trilinos teams as soon as they're opened.

GrahamBenHarper avatar Oct 26 '22 14:10 GrahamBenHarper

I may be mistaken, but I believe that users who are not in the Trilinos Github group cannot tag individual Trilinos teams.

That is correct. That is a long-known flaw in the Trilinos Issue tracking processes.

bartlettroscoe avatar Oct 26 '22 18:10 bartlettroscoe

Looking at the above query, the last failure was 10/5 and I was never able to reproduce it. So I am going to close this for now. If it fails again, please reopen it.

etphipp avatar Nov 28 '22 21:11 etphipp