Trilinos
Trilinos copied to clipboard
Stokhos: Test Stokhos_KokkosViewUQPCEUnitTest_Serial_MPI_1 randomly failing in 'ats2' CUDA PR build on 'vortex'
SUMMARY:
CC: @trilinos/stokhos
Next Action Status
Description
As shown in this query (click "Shown Matching Output" in upper right) the test:
-
Stokhos_KokkosViewUQPCEUnitTest_Serial_MPI_1
in the builds:
-
PR-10472-test-ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables-911
-
PR-10472-test-ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables-915
-
PR-10571-test-ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables-1113
-
PR-11086-test-ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables-1182
-
PR-11099-test-ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables-1211
started failing on testing day 2022-05-01.
When the test fails, it produces error output like shown here showing:
3. Kokkos_View_PCE_DS_LayoutLeft_DeepCopy_NonContiguous_UnitTest ...
val = 2.21341409336878452e-321 == val_expected = 1.01000000000000000e+02 : FAILED ==> /vscratch1/trilinos/jaas/workspace/Trilinos_PR_cuda-10.1.243/Trilinos/packages/stokhos/test/UnitTest/Stokhos_KokkosViewUQPCEUnitTest.hpp:149
val = 3.56221330651538758e-321 == val_expected = 1.01099999999999994e+02 : FAILED ==> /vscratch1/trilinos/jaas/workspace/Trilinos_PR_cuda-10.1.243/Trilinos/packages/stokhos/test/UnitTest/Stokhos_KokkosViewUQPCEUnitTest.hpp:149
val = 6.95327277181438017e-310 == val_expected = 1.01200000000000003e+02 : FAILED ==> /vscratch1/trilinos/jaas/workspace/Trilinos_PR_cuda-10.1.243/Trilinos/packages/stokhos/test/UnitTest/Stokhos_KokkosViewUQPCEUnitTest.hpp:149
...
val = 4.79243676466009148e-322 == val_expected = 1.02718181818181819e+02 : FAILED ==> /vscratch1/trilinos/jaas/workspace/Trilinos_PR_cuda-10.1.243/Trilinos/packages/stokhos/test/UnitTest/Stokhos_KokkosViewUQPCEUnitTest.hpp:149
val = 1.01909090909090907e+02 == val_expected = 1.01909090909090907e+02 : passed
val = 1.02009090909090901e+02 == val_expected = 1.02009090909090901e+02 : passed
val = 1.02109090909090909e+02 == val_expected = 1.02109090909090909e+02 : passed
val = 1.02209090909090904e+02 == val_expected = 1.02209090909090904e+02 : passed
val = 1.02309090909090912e+02 == val_expected = 1.02309090909090912e+02 : passed
val = 1.02409090909090907e+02 == val_expected = 1.02409090909090907e+02 : passed
val = 1.02509090909090901e+02 == val_expected = 1.02509090909090901e+02 : passed
val = 1.02609090909090909e+02 == val_expected = 1.02609090909090909e+02 : passed
val = 1.02709090909090904e+02 == val_expected = 1.02709090909090904e+02 : passed
val = 1.02809090909090912e+02 == val_expected = 1.02809090909090912e+02 : passed
[FAILED] (0.00153 sec) Kokkos_View_PCE_DS_LayoutLeft_DeepCopy_NonContiguous_UnitTest
Location: /vscratch1/trilinos/jaas/workspace/Trilinos_PR_cuda-10.1.243/Trilinos/packages/stokhos/test/UnitTest/Stokhos_KokkosViewUQPCEUnitTest.hpp:266
Current Status on CDash
Run the above query adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day.
Steps to Reproduce
Follow instructions at:
- https://github.com/trilinos/Trilinos/wiki/Reproducing-PR-Testing-Errors
or see:
- https://gitlab-ex.sandia.gov/rabartl/run_trilinos_pr_builds/-/blob/master/README.ats2.md
for specific instructions on how to build and run on 'vortex'.
FYI: This failure took out my last PR build iteration https://github.com/trilinos/Trilinos/pull/11099#issuecomment-1269142489 (see https://github.com/trilinos/Trilinos/pull/11099#issuecomment-1269249698).
So far I have not been able to reproduce this, either on the ATS2 platform or on a regular Linux platform (note the failing test is running with the Serial execution space, so whatever is going on isn't related to CUDA). I've also tried running the test under valgrind and with the clang address sanitizer. Both came up empty.
@etphipp, it was reported at the TUG today that Sacado might have some undefined memory issues. Does this use DFAD or the reverse AD types?
@etphipp, it was reported at the TUG today that Sacado might have some undefined memory issues. Does this use DFAD or the reverse AD types?
Yes. It is issue #7741. I never saw it because the team mention was invalid (which is probably a frighteningly common mistake due to the extra characters in the suggested team mention in the Trilinos issue template). I'm working on it now and believe I might have it fixed. It is due to the horribly designed memory management in RAD.
I never saw it because the team mention was invalid (which is probably a frighteningly common mistake due to the extra characters in the suggested team mention in the Trilinos issue template).
I may be mistaken, but I believe that users who are not in the Trilinos Github group cannot tag individual Trilinos teams. This is why with a lot of recent issues you will see @cgcgcg working hard to tag the correct Trilinos teams as soon as they're opened.
I may be mistaken, but I believe that users who are not in the Trilinos Github group cannot tag individual Trilinos teams.
That is correct. That is a long-known flaw in the Trilinos Issue tracking processes.
Looking at the above query, the last failure was 10/5 and I was never able to reproduce it. So I am going to close this for now. If it fails again, please reopen it.