abacus-develop icon indicating copy to clipboard operation
abacus-develop copied to clipboard

Precision Discrepancy of stress between single- and multi-core

Open Cstandardlib opened this issue 9 months ago • 2 comments

Describe the bug

When running cell-relax of a FCC-Al(See #6141 ), I discovered that stress given by CPU and GPU will diverge slightly between single- and multi-core. All parts of stress except the EWALD term have some deviation between computing configurations.

In this case, the first-step total-stress results(KBAR) are as follows: (All with OMP_NUM_THREADS=1)

  1. CPU, mpirun -np 1: -22.440750
  2. GPU, mpirun -np 1: -22.441476
  3. CPU, mpirun -np 4: -22.244413
  4. GPU, mpirun -np 4: -22.264452

Expected behavior

Should the results be nearly the same between single- and multi-core?

To Reproduce

A simple case that can be downloaded from https://github.com/mcresearch/abacus-user-guide/tree/master/examples/surface_energy/Al_fcc100/0_bulk.

Environment

  • OS: Ubuntu 22.04.4 LTS
  • Compiler:
    • gcc version 12.3.0 (Ubuntu 12.3.0-1ubuntu1~22.04)
    • nvcc Build cuda_12.4.r12.4/compiler.33961263_0
  • ABACUS v3.9.0.2 Commit: 35448cbe7 (Mon Mar 31 09:24:22 2025 +0800)
  • Built with
cmake -B build -DUSE_CUDA=ON
cmake --build build -j`nproc`

Additional Context

No response

Task list for Issue attackers (only for developers)

  • [ ] Verify the issue is not a duplicate.
  • [ ] Describe the bug.
  • [ ] Steps to reproduce.
  • [ ] Expected behavior.
  • [ ] Error message.
  • [ ] Environment details.
  • [ ] Additional context.
  • [ ] Assign a priority level (low, medium, high, urgent).
  • [ ] Assign the issue to a team member.
  • [ ] Label the issue with relevant tags.
  • [ ] Identify possible related issues.
  • [ ] Create a unit test or automated test to reproduce the bug (if applicable).
  • [ ] Fix the bug.
  • [ ] Test the fix.
  • [ ] Update documentation (if necessary).
  • [ ] Close the issue and inform the reporter (if applicable).

Cstandardlib avatar Apr 11 '25 09:04 Cstandardlib

@pxlxingliang hello,can you retest this case?

dyzheng avatar Apr 22 '25 07:04 dyzheng

@pxlxingliang hello,can you retest this case?

I have retest this case with bohrium image "registry.dp.tech/dptech/abacus-stable:LTSv3.10" on CPU, the results of 1 core and multi cores are not exactly same.

Energy (eV) stress 11(kbar) d_energy of last SCF step drho of last SCF step
mpi 1 -1883.2222505012 -22.1915720470 -4.85765150e-08 2.3881e-10
mpi 2 -1883.2222505016 -22.3195005335 -1.47594585e-08 1.8349e-11
mpi 4 -1883.2222505009 -22.3739405740 -6.19875604e-10 5.2155e-11

I try to set pw_seed to 0 to fix the random seed of initial guess density, and the results of different parallel cores are almost same, but the difference is slowly increasing as the calculation proceed. This error should be from the error of numerical addition in MPI.

Energy (eV) stress 11(kbar) d_energy of last SCF step drho of last SCF step
mpi 1 -1883.2222505025 -22.2373067917 -2.53522548e-08 1.6637e-10
mpi 2 -1883.2222505017 -22.2373301205 -2.42250325e-08 1.6622e-10
mpi 4 -1883.2222505012 -22.2373176241 -2.44562774e-08 1.6631e-10

Energies of SCF process with mpi 1

inputs-pwseed0/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4128704941      -1883.2037152559     
inputs-pwseed0/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4136269962      -1883.2140079951     
inputs-pwseed0/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142420330      -1883.2223759999     
inputs-pwseed0/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142100680      -1883.2219410938     
inputs-pwseed0/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142263472      -1883.2221625834     
inputs-pwseed0/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142319274      -1883.2222385068     
inputs-pwseed0/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142331332      -1883.2222549120     
inputs-pwseed0/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142331330      -1883.2222549094     
inputs-pwseed0/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142327125      -1883.2222491876     
inputs-pwseed0/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142328007      -1883.2222503881     
inputs-pwseed0/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142328157      -1883.2222505915     
inputs-pwseed0/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142328156      -1883.2222505901     
inputs-pwseed0/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142328072      -1883.2222504771     
inputs-pwseed0/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142328091      -1883.2222505025 

Energies of SCF process with mpi 2

inputs-pwseed0-mpi2/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4128704941      -1883.2037152559     
inputs-pwseed0-mpi2/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4136269962      -1883.2140079951     
inputs-pwseed0-mpi2/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142420330      -1883.2223759999     
inputs-pwseed0-mpi2/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142100680      -1883.2219410937     
inputs-pwseed0-mpi2/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142263472      -1883.2221625834     
inputs-pwseed0-mpi2/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142319274      -1883.2222385068     
inputs-pwseed0-mpi2/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142331332      -1883.2222549120     
inputs-pwseed0-mpi2/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142331330      -1883.2222549095     
inputs-pwseed0-mpi2/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142327124      -1883.2222491862     
inputs-pwseed0-mpi2/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142328007      -1883.2222503886     
inputs-pwseed0-mpi2/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142328156      -1883.2222505904     
inputs-pwseed0-mpi2/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142328155      -1883.2222505890     
inputs-pwseed0-mpi2/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142328073      -1883.2222504775     
inputs-pwseed0-mpi2/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142328091      -1883.2222505017     

Energies of SCF process with mpi 4

inputs-pwseed0-mpi4/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4128704941      -1883.2037152559     
inputs-pwseed0-mpi4/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4136269962      -1883.2140079951     
inputs-pwseed0-mpi4/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142420330      -1883.2223759999     
inputs-pwseed0-mpi4/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142100680      -1883.2219410937     
inputs-pwseed0-mpi4/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142263472      -1883.2221625834     
inputs-pwseed0-mpi4/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142319274      -1883.2222385068     
inputs-pwseed0-mpi4/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142331332      -1883.2222549120     
inputs-pwseed0-mpi4/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142331330      -1883.2222549097     
inputs-pwseed0-mpi4/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142327124      -1883.2222491867     
inputs-pwseed0-mpi4/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142328008      -1883.2222503899     
inputs-pwseed0-mpi4/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142328156      -1883.2222505905     
inputs-pwseed0-mpi4/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142328155      -1883.2222505895     
inputs-pwseed0-mpi4/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142328072      -1883.2222504767     
inputs-pwseed0-mpi4/OUT.ABACUS/running_scf.log: E_KohnSham     -138.4142328090      -1883.2222505012     

pxlxingliang avatar Apr 25 '25 02:04 pxlxingliang