Dice
Dice copied to clipboard
Tests failing
Default tests
I have run the tests on 4 MPI tasks (as hardcoded in the tests scripts), and they all pass flawlessly, except for the DQMC/multislater_ghf_gi
one, for which I get:
...running DQMC/multislater_ghf_gi
DQMC: ./eigen/Eigen/src/Core/Block.h:146: Eigen::Block<XprType, BlockRows, BlockCols, InnerPanel>::Block(XprType&, Eigen::Index, Eigen::Index, Eigen::Index, Eigen::Index) [with XprType = Eigen::Map<Eigen::Matrix<double, -1, -1>, 0, Eigen::Stride<0, 0> >; int BlockRows = -1; int BlockCols = -1; bool InnerPanel = false; Eigen::Index = long int]: Assertion `startRow >= 0 && blockRows >= 0 && startRow <= xpr.rows() - blockRows && startCol >= 0 && blockCols >= 0 && startCol <= xpr.cols() - blockCols' failed.
[std-hb2-pg0-9:432066] *** Process received signal ***
[std-hb2-pg0-9:432066] Signal: Aborted (6)
[std-hb2-pg0-9:432066] Signal code: (-6)
[std-hb2-pg0-9:432066] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7ff8d06d6520]
[std-hb2-pg0-9:432066] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7ff8d072a9fc]
[std-hb2-pg0-9:432066] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7ff8d06d6476]
[std-hb2-pg0-9:432066] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7ff8d06bc7f3]
[std-hb2-pg0-9:432066] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7ff8d06bc71b]
[std-hb2-pg0-9:432066] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7ff8d06cde96]
[std-hb2-pg0-9:432066] [ 6] DQMC(+0x15170e)[0x55dc226a270e]
[std-hb2-pg0-9:432066] [ 7] DQMC(+0x1e5e17)[0x55dc22736e17]
[std-hb2-pg0-9:432066] [ 8] DQMC(+0x2e548)[0x55dc2257f548]
[std-hb2-pg0-9:432066] [ 9] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7ff8d06bdd90]
[std-hb2-pg0-9:432066] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7ff8d06bde40]
[std-hb2-pg0-9:432066] [11] DQMC(+0x2eb85)[0x55dc2257fb85]
[std-hb2-pg0-9:432066] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node DESKTOP-HSCRDM6 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
Different number of MPI tasks
When I tried to do the same on a different number of MPI tasks, for example 2, 6, 8, 16, most (or all) tests fail. Often the energy difference with respect to the reference value can be of the order of 0.1 or 0.01, other times it is of the order of 0.001, so well above the set tolerances (1e-6 or 1e-7).
Example test output
## for practical purposes, I modified the `testEnergy.py` script to print the energy differences. [lercole@std-hb2-pg0-9 tests]$ sed -i 's/mpirun -np 4/mpirun -np 8/g' run*sh [lercole@std-hb2-pg0-9 tests]$ ./runTests.sh Running Tests for VMC/GFMC/NEVPT2/FCIQMC/DQMC ====================================================== Running Tests for VMC ====================================================== ...running hubbard_1x10 test failed eTest - eRef = 1.8742e-02 eRef = -4.17498229 eTest = -4.15624069 ...running hubbard_1x10 ghf test failed eTest - eRef = -3.5035e-02 eRef = -5.42482593 eTest = -5.45986127 ...running hubbard_1x10 agp test failed eTest - eRef = -2.5273e-02 eRef = -3.83213158 eTest = -3.85740436 ...running hubbard_1x14 test failed eTest - eRef = 1.7223e-02 eRef = -10.78302325 eTest = -10.76580073 ...running hubbard_1x22 test failed eTest - eRef = -1.2980e-02 eRef = -7.95585667 eTest = -7.96883698 ...running hubbard_1x50 test failed eTest - eRef = 6.0798e-02 eRef = -38.80608576 eTest = -38.74528821 ...running hubbard_1x6 test passed eTest - eRef = 0.0000e+00 ...running hubbard_18_tilt uhf test failed eTest - eRef = -9.6450e-02 eRef = -16.29056087 eTest = -16.38701062 ...running h4 ghf complex test failed eTest - eRef = -1.3367e-03 eRef = -2.14652278 eTest = -2.14785948 ...running h4 pfaffian complex test failed eTest - eRef = -2.5898e-03 eRef = -2.14267989 eTest = -2.1452697 ...running h10 pfaffian test failed eTest - eRef = -1.7162e-03 eRef = -5.19471654 eTest = -5.19643273 ...running h20 test failed eTest - eRef = -9.8264e-03 eRef = -7.06168529 eTest = -7.07151173 ...running h20 ghf test failed eTest - eRef = 3.6540e-03 eRef = -10.28553672 eTest = -10.28188272 ...running c2 test failed eTest - eRef = 2.7509e-03 eRef = -74.55055844 eTest = -74.5478075 Running Tests for GFMC ====================================================== ...running hubbard_18_tilt uhf test failed eTest - eRef = -9.6450e-02 eRef = -16.29056087 eTest = -16.38701062 ...running hubbard_18_tilt gfmc test failed eTest - eRef = -3.1810e-02 eRef = -16.88259405 eTest = -16.9144044 Running Tests for NEVPT2 ====================================================== ...running NEVPT2/n2_vdz/stoch test failed eTest - eRef = 4.0540e-04 eRef = -109.1846287 eTest = -109.1842233 ...running NEVPT2/n2_vdz/continue_norms PRINT test failed eTest - eRef = 7.6540e-04 eRef = -109.183194 eTest = -109.1824286 ...running NEVPT2/n2_vdz/continue_norms READ test failed eTest - eRef = 7.6750e-04 eRef = -109.1825321 eTest = -109.1817646 ...running NEVPT2/n2_vdz/exact_energies PRINT test passed eTest - eRef = 0.0000e+00 ...running NEVPT2/n2_vdz/exact_energies READ test passed eTest - eRef = 0.0000e+00 ...running NEVPT2/h4_631g/determ test passed eTest - eRef = 0.0000e+00 ...running NEVPT2/polyacetylene/stoch test failed eTest - eRef = -1.6040e-04 eRef = -155.1823833 eTest = -155.1825437 ...running NEVPT2/n2_vdz/single_perturber determ test passed stoch test passed Running Tests for FCIQMC ====================================================== ...running FCIQMC/He2 test failed eTest - eRef = 1.1937e-03 eRef = -5.762943084279232 eTest = -5.761749337716283 ...running FCIQMC/He2_hb_uniform test failed eTest - eRef = 1.2809e-04 eRef = -5.762337845140905 eTest = -5.762209752789398 ...running FCIQMC/Ne_plateau test failed eTest - eRef = 2.9082e-04 eRef = -128.70958249279593 eTest = -128.70929167168606 ...running FCIQMC/Ne_initiator test failed eTest - eRef = 1.2952e-02 eRef = -128.72525652060892 eTest = -128.71230417093656 ...running FCIQMC/Ne_initiator_replica test failed eRef1 = -128.70570937888726 eTest1 = -128.71161315222963 eRef2 = -128.70849149597655 eTest2 = -128.70353835294057 eRefVar = -128.54030751935989 eTestVar = -128.37441541385843 eRefEN2 = 0.0 eTestEN2 = 0.0 ...running FCIQMC/Ne_initiator_en2 test failed eRef1 = -128.70745111511025 eTest1 = -128.704079503759 eRef2 = -128.70728765356745 eTest2 = -128.70076936736214 eRefVar = -128.88505380643883 eTestVar = -128.95491510237173 eRefEN2 = -0.020670887982356057 eTestEN2 = 0.00579960323366854 ...running FCIQMC/Ne_initiator_en2_ss test failed eRef1 = -128.70928156824655 eTest1 = -128.70948917426213 eRef2 = -128.70965980981856 eTest2 = -128.71001418429785 eRefVar = -128.70575284782976 eTestVar = -128.70536975709993 eRefEN2 = 0.0011977728994427303 eTestEN2 = -0.004109782185213983 ...running FCIQMC/water_vdz_hb test failed eTest - eRef = -1.0011e-03 eRef = -76.24055896137513 eTest = -76.24156003663879 Running Tests for AFQMC ====================================================== ...running DQMC/rhf_rhf test failed eTest - eRef = 4.2476e-03 eRef = -76.121061333 eTest = -76.11681368910571 wTest - wRef = 7.9938e+01 wRef = 80.019399 wTest = 159.9578402339277 ...running DQMC/rhf_uhf test failed eTest - eRef = -1.1286e-02 eRef = -5.3551361163 eTest = -5.366422420038034 wTest - wRef = 7.9821e+01 wRef = 79.869323 wTest = 159.6899502970639 ...running DQMC/uhf_rhf test failed eTest - eRef = -5.5420e-03 eRef = -5.3694725281 eTest = -5.375014518745735 wTest - wRef = 7.9929e+01 wRef = 79.914801 wTest = 159.8441082913229 ...running DQMC/uhf_uhf test failed eTest - eRef = 1.7012e-03 eRef = -75.687851842 eTest = -75.68615062549716 wTest - wRef = 7.9981e+01 wRef = 79.938639 wTest = 159.919893886195 ...running DQMC/multislater_rhf test failed eTest - eRef = -1.5162e-03 eRef = -109.09511182 eTest = -109.0966280240538 wTest - wRef = 1.9942e+01 wRef = 19.984035 wTest = 39.92619147592583 ...running DQMC/multislater_uhf test failed eTest - eRef = -3.4923e-03 eRef = -37.753007223 eTest = -37.75649954823818 wTest - wRef = 7.9963e+01 wRef = 79.787989 wTest = 159.750495666826 ...running DQMC/ghf_ghf_soc test failed eTest - eRef = 4.7815e-02 eRef = -153.33796367 eTest = -153.2901486912503 wTest - wRef = 1.9919e+01 wRef = 19.870107 wTest = 39.78864245446707 ...running DQMC/uhf_uhf_ui test failed eTest - eRef = 1.5279e-03 eRef = -3.0949887079 eTest = -3.093460783634393 wTest - wRef = 8.0021e+01 wRef = 80.039541 wTest = 160.0602107374257 ...running DQMC/multislater_uhf_ui test failed eTest - eRef = 1.0254e-03 eRef = -75.683616692 eTest = -75.68259127061897 wTest - wRef = 7.9847e+01 wRef = 79.855562 wTest = 159.7022658597608 ...running DQMC/ghf_ghf_gi test failed eTest - eRef = 2.5796e-03 eRef = -1.430762509 eTest = -1.428182921908405 wTest - wRef = 7.9826e+01 wRef = 79.864586 wTest = 159.6908960563758 ...running DQMC/multislater_ghf_gi DQMC: /anfhome/spack/opt/spack/__spack_path_placeholder__/__spack_path_placeholder__/__spack_path_placeholder__/__spack_path_placehold/linux-almalinux8-zen2/gcc-13.2.0/eigen-3.4.0-vhwiejcim3wl4uwfktlwdoxazb3ejmyl/include/eigen3/Eigen/src/Core/Block.h:146: Eigen::Block::Block(XprType&, Eigen::Index, Eigen::Index, Eigen::Index, Eigen::Index) [with XprType = Eigen::Map<:matrix>, 0, Eigen::Stride >; int BlockRows = -1; int BlockCols = -1; bool InnerPanel = false; Eigen::Index = long int]: Assertion `startRow >= 0 && blockRows >= 0 && startRow = 0 && blockCols >= 0 && startCol fh = open('samples.dat', 'r') FileNotFoundError: [Errno 2] No such file or directory: 'samples.dat'
I have tried building Dice with GCC 13.2, GCC 11.4, and ICC 2021.10, and I always get these inconsistencies. I am linking it with [email protected], [email protected], MKL 2023.2, and OpenMPI.
Have you ever seen this behavior and do you understand where these differences may come from? Or is within the expected statistical fluctuations due to the stochastic nature of the method? Thanks
@xubo-wang should know about the ghf test, it has been failing for a bit I think.
About the number of tasks, this is because the convention in our code is to increase the number of samples with an increasing number of tasks. So the sampling input options are per task, and the tests only work with four tasks.