Ion-ion energy convergence in quasi-2D systems
Describe the bug Reference ion-ion energy of a quasi-2D system (bilayer graphene 3x3x1 tiling at gamma) converges to an incorrect value for bilayer separation at 7 Angstroms.
Due to the mismatch between the reference and QMCPACK ion-ion energies, after printing "Checking ion-ion Ewald energy against reference", QMCPACK terminates. I have used ewald lr_handler with increasing kc cutoff up to 100. Despite the increased cutoff, the problem is not resolved. For the set of ion-ion energy results below, I have used ccecp potentials and LDA functional:
| kc_cutoff | QMCPACK | Reference | QE |
|---|---|---|---|
| 50 | 1131.3290657604 | 1131.2315948714 | 1131.32906586 |
| 75 | 1131.3290657129 | 1131.2315948714 | 1131.32906586 |
| 100 | 1131.3290656824 | 1131.2315948714 | 1131.32906586 |
QE values are printed in Ry for the primitive cell, therefore they are adjusted by multiplying with 9/2 for the 3x3x1 tiled supercell.
In the legacy code, the error was printed at every run I tested:
ERROR in ion-ion Ewald energy exceeds 0.0003 Ha/atom tolerance.
Reference ion-ion energy: 1131.2315948714
QMCPACK ion-ion energy: 1131.3290657604
ion-ion diff : 0.097470889017131
diff/atom : 0.0027075246949203
tolerance : 0.0003
However, there were instances where the error was not printed in the batched code (see dmc folder in the attached files)
To Reproduce Steps to reproduce the behavior:
- Using QMCPACK 3.15.9 and QE 7.0
- using complex legacy cpu and batched cpu variants
- full program/test invocation command: srun -N 4 -c 32 --cpu-bind=cores -n 4 qmcpack_complex vmc.in.xml
- additional steps: None
Expected behavior Reference values in the table should match the QE and QMCPACK ion-ion values within some tolerance with increasing kc cutoff. Batched code should also print the error at every run if encountered.
System:
- System: Andes
- Module list: 1) gcc/9.3.0 2) openblas/0.3.17-omp 3) netlib-lapack/3.9.1 4) fftw/3.3.10 5) cmake/3.18.4 6) cuda/10.2.89 7) openmpi/4.0.4 8) hdf5/1.10.7 9) boost/1.74.0
- other systems where this is reproducible: none
Additional context Add any other context about the problem here. Input/output files: ewald_sum.zip
The issue has nothing to do with drivers. Error printing happens during parsing Coulomb input. With rank 0 printing, other rank may run to error first and terminate all the ranks before rank 0 prints. Change to UniformCommunicateError addresses the issue. Please test out the fix.
The printing part is secondary to the main issue here. The main problem is that the reference energy is incorrect (likely due to premature termination of the sum). This has not been fixed, right?