cp2k icon indicating copy to clipboard operation
cp2k copied to clipboard

CP2K fails with large cells and high number of OMP threads

Open abussy opened this issue 3 years ago • 6 comments

A question in the google group (https://groups.google.com/g/cp2k/c/dfuWp6UaIOY) sparked an investigation into the behaviour of the wavelet Poisson solver. Running a calculation with a 50A x 50A x 50A cell and 36 OMP threads lead to non-physical energies and warnings by the wavelet solver. It turns out that all Poisson solvers yield wrong result in such conditions.

I tested the input below with the ssmp and psmp (1 MPI rank) executables, with 16 and 32 OMP threads, for all Poisson solvers (making sure to change the periodicity when needed). Runs with 16 threads were all successful, and runs with 32 threads all failed. Reducing the box size to 40A x 40A x 40A and using 32 threads works as well.

In the end it seems that the problem is independent the Poisson solver, contrary to the initial assumptions. I currently have no idea of what goes wrong.

&FORCE_EVAL                                                                                          
  METHOD Quickstep                                                                                   
  &DFT                                                                                               
    BASIS_SET_FILE_NAME BASIS_MOLOPT                                                                 
    POTENTIAL_FILE_NAME POTENTIAL                                                                    
                                                                                                     
    &POISSON                                                                                         
      PERIODIC NONE !XYZ                                                                             
      PSOLVER WAVELET !ANALYTIC IMPLICIT MT MULTIPLOE PERIODIC                                       
    &END                                                                                             
                                                                                                     
    &XC                                                                                              
      &XC_FUNCTIONAL PBE                                                                             
      &END XC_FUNCTIONAL                                                                             
    &END XC                                                                                          
    &END DFT                                                                                         
  &SUBSYS                                                                                            
    &CELL                                                                                            
      ABC 50.0 50.0 50.0                                                                             
      PERIODIC NONE !XYZ                                                                             
    &END CELL                                                                                        
    &COORD                                                                                           
      O          0.00000        0.00000        0.11779                                               
      H          0.00000        0.75545       -0.47116                                               
      H          0.00000       -0.75545       -0.47116                                               
    &END                                                                                             
    &TOPOLOGY                                                                                        
      &CENTER_COORDINATES                                                                            
      &END CENTER_COORDINATES                                                                        
    &END TOPOLOGY                                                                                    
    &KIND H                                                                                          
      BASIS_SET SZV-MOLOPT-SR-GTH                                                                    
      POTENTIAL GTH-PBE                                                                              
    &END KIND                                                                                        
    &KIND O                                                                                          
      BASIS_SET SZV-MOLOPT-SR-GTH                                                                    
      POTENTIAL GTH-PBE                                                                              
    &END KIND                                                                                        
  &END SUBSYS                                                                                        
&END FORCE_EVAL                                                                                      
&GLOBAL                                                                                              
  PROJECT test                                                                                       
  PRINT_LEVEL MEDIUM                                                                                 
  RUN_TYPE ENERGY                                                                                    
&END GLOBAL

abussy avatar Jan 14 '22 15:01 abussy

I found which file produces the error. If I comment out all the #pragma omp directives from src/grid/ref/grid_ref_task_list.c, the above example runs fine with 32 threads. @oschuett I think you wrote this, maybe you can have a look into it ?

Note that the issue seems related to the number of grid points, as increasing the plane wave cutoff may also lead to the same behaviour.

abussy avatar Jan 18 '22 10:01 abussy

By any chance, could you try to increase the stack size?

ulimit -s 256000
export OMP_STACKSIZE=256M

alazzaro avatar Jan 18 '22 10:01 alazzaro

A stack overflow would have also been my first guess. It could also be an integer overflow.

See also #1775, which seems related.

oschuett avatar Jan 18 '22 10:01 oschuett

Perhaps this issue shows the desire to lower the stacksize (despite of the ability to increase it). The grid code uses C99 which allows to have stack-based arrays with sizes only known at runtime. However, it can be wise to allocate from the heap or to introduce some scratch memory buffer.

hfp avatar Jan 18 '22 11:01 hfp

I agree, that the grid code's stack usage might be a bit excessive. Unfortunately. just calling malloc is too slow. For DBM I therefore added a memory pool.

We should probably try to have a shared pool for all of CP2K, assuming we can come up with a good algorithm for managing it.

oschuett avatar Jan 18 '22 13:01 oschuett

By any chance, could you try to increase the stack size?

ulimit -s 256000 export OMP_STACKSIZE=256M

I tried it with no success, the SCF cycle still goes crazy.

abussy avatar Jan 18 '22 13:01 abussy