cp2k icon indicating copy to clipboard operation
cp2k copied to clipboard

Random crash with ubuntu package

Open loikki opened this issue 2 years ago • 2 comments

Hi,

I am trying to use CP2K from apt-get (ubuntu 22.04 and 23.04, version 2023.1-2 and 9.1-2) within a container (docker for building, singularity for running). Unfortunately, I get some random crashes (happens roughly once every 200-300 runs). I have no experience with CP2K and uses it only for some benchmarking, so I might be doing something stupid.

Here is the error:

Mar 17 09:34:55 NAME bash[38838]: Program received signal SIGABRT: Process abort signal.
Mar 17 09:34:55 NAME bash[38838]: Backtrace for this error:
Mar 17 09:34:55 NAME bash[38838]: #0  0x7ff350622ad0 in ???
Mar 17 09:34:55 NAME bash[38838]: #1  0x7ff350621c35 in ???
Mar 17 09:34:55 NAME bash[38838]: #2  0x7ff3501cb51f in ???
Mar 17 09:34:55 NAME bash[38838]: #3  0x7ff35021fa7c in pthread_kill
Mar 17 09:34:55 NAME bash[38838]: #4  0x7ff3501cb475 in raise
Mar 17 09:34:55 NAME bash[38838]: #5  0x7ff3501b17f2 in abort
Mar 17 09:34:55 NAME bash[38838]: #6  0x55a5cb392ef0 in ???
Mar 17 09:34:55 NAME bash[38838]: #7  0x55a5c93ecace in ???
Mar 17 09:34:55 NAME bash[38838]: #8  0x55a5c953dc01 in ???
Mar 17 09:34:55 NAME bash[38838]: #9  0x55a5c9427233 in ???
Mar 17 09:34:55 NAME bash[38838]: #10  0x55a5c9427cc3 in ???
Mar 17 09:34:55 NAME bash[38838]: #11  0x55a5c9326860 in ???
Mar 17 09:34:55 NAME bash[38838]: #12  0x55a5c9329744 in ???
Mar 17 09:34:55 NAME bash[38838]: #13  0x55a5c9323e61 in ???
Mar 17 09:34:55 NAME bash[38838]: #14  0x55a5c9321eee in ???
Mar 17 09:34:55 NAME bash[38838]: #15  0x7ff3501b2d8f in ???
Mar 17 09:34:55 NAME bash[38838]: #16  0x7ff3501b2e3f in __libc_start_main
Mar 17 09:34:55 NAME bash[38838]: #17  0x55a5c9321f24 in ???
Mar 17 09:34:55 NAME bash[38838]: #18  0xffffffffffffffff in ???

Here is my dockerfile:

# Final image
FROM ubuntu:23.04

ARG CP2K_VERSION=2023.1-2

# Install packages
RUN apt update && \
    apt install -y --no-install-recommends bc numactl \
        python3 python3-pip libglu1-mesa freeglut3-dev cp2k=${CP2K_VERSION} \
        openmpi-bin libopenblas0-openmp && \

This is how I run CP2K. I have 128 cores on the machine (2x AMD EPYC 7742), so a process crashes every 2-3 runs:

# Run CP2K on 1 core NUM_PROCS time
for i in $(seq 0 $(($NUM_PROCS-1)));
do
    numactl --physcpubind=$i cp2k.popt -i ${INPUT} -o out-$i.log &
done
wait

Here is the input file:

&FORCE_EVAL
  METHOD QS
  &DFT
    BASIS_SET_FILE_NAME GTH_BASIS_SETS
    POTENTIAL_FILE_NAME POTENTIAL
    &MGRID
      CUTOFF 280
      REL_CUTOFF 30
    &END MGRID
    &QS
      EPS_DEFAULT 1.0E-12
      WF_INTERPOLATION PS
      EXTRAPOLATION_ORDER 3
    &END QS
    &SCF
      SCF_GUESS ATOMIC
      &OT ON
        MINIMIZER DIIS
      &END OT
    # SCF_GUESS        RESTART
    # EPS_SCF      1.0E-7
      MAX_SCF      5
      &PRINT
        &RESTART OFF
        &END
      &END
    &END SCF
    &XC
      &XC_FUNCTIONAL Pade
      &END XC_FUNCTIONAL
    &END XC
  &END DFT
  &SUBSYS
    &CELL
      ABC 9.8528 9.8528 9.8528
    &END CELL
    # 32 H2O (TIP5P,1bar,300K) a = 9.8528
    &COORD
   O       2.280398       9.146539       5.088696
   O       1.251703       2.406261       7.769908
   O       1.596302       6.920128       0.656695
   O       2.957518       3.771868       1.877387
   O       0.228972       5.884026       6.532308
   O       9.023431       6.119654       0.092451
   O       7.256289       8.493641       5.772041
   O       5.090422       9.467016       0.743177
   O       6.330888       7.363471       3.747750
   O       7.763819       8.349367       9.279457
   O       8.280798       3.837153       5.799282
   O       8.878250       2.025797       1.664102
   O       9.160372       0.285100       6.871004
   O       4.962043       4.134437       0.173376
   O       2.802896       8.690383       2.435952
   O       9.123223       3.549232       8.876721
   O       1.453702       1.402538       2.358278
   O       6.536550       1.146790       7.609732
   O       2.766709       0.881503       9.544263
   O       0.856426       2.075964       5.010625
   O       6.386036       1.918950       0.242690
   O       2.733023       4.452756       5.850203
   O       4.600039       9.254314       6.575944
   O       3.665373       6.210561       3.158420
   O       3.371648       6.925594       7.476036
   O       5.287920       3.270653       6.155080
   O       5.225237       6.959594       9.582991
   O       0.846293       5.595877       3.820630
   O       9.785620       8.164617       3.657879
   O       8.509982       4.430362       2.679946
   O       1.337625       8.580920       8.272484
   O       8.054437       9.221335       1.991376
   H       1.762019       9.820429       5.528454
   H       3.095987       9.107088       5.588186
   H       0.554129       2.982634       8.082024
   H       1.771257       2.954779       7.182181
   H       2.112148       6.126321       0.798136
   H       1.776389       7.463264       1.424030
   H       3.754249       3.824017       1.349436
   H       3.010580       4.524142       2.466878
   H       0.939475       5.243834       6.571945
   H       0.515723       6.520548       5.877445
   H       9.852960       6.490366       0.393593
   H       8.556008       6.860063      -0.294256
   H       7.886607       7.941321       6.234506
   H       7.793855       9.141028       5.315813
   H       4.467366       9.971162       0.219851
   H       5.758685      10.102795       0.998994
   H       6.652693       7.917443       3.036562
   H       6.711966       7.743594       4.539279
   H       7.751955       8.745180      10.150905
   H       7.829208       9.092212       8.679343
   H       8.312540       3.218330       6.528858
   H       8.508855       4.680699       6.189990
   H       9.742249       1.704975       1.922581
   H       8.799060       2.876412       2.095861
   H       9.505360       1.161677       6.701213
   H       9.920117      -0.219794       7.161006
   H       4.749903       4.186003      -0.758595
   H       5.248010       5.018415       0.403676
   H       3.576065       9.078451       2.026264
   H       2.720238       9.146974       3.273164
   H       9.085561       4.493058       9.031660
   H       9.215391       3.166305       9.749133
   H       1.999705       2.060411       1.927796
   H       1.824184       0.564565       2.081195
   H       7.430334       0.849764       7.438978
   H       6.576029       1.537017       8.482885
   H       2.415851       1.576460       8.987338
   H       2.276957       0.099537       9.289499
   H       1.160987       1.818023       4.140602
   H       0.350256       2.874437       4.860741
   H       5.768804       2.638450       0.375264
   H       7.221823       2.257514       0.563730
   H       3.260797       5.243390       5.962382
   H       3.347848       3.732214       5.988196
   H       5.328688       9.073059       5.982269
   H       5.007063       9.672150       7.334875
   H       4.566850       6.413356       3.408312
   H       3.273115       7.061666       2.963521
   H       3.878372       7.435003       6.843607
   H       3.884673       6.966316       8.283117
   H       5.918240       3.116802       5.451335
   H       5.355924       2.495093       6.711958
   H       5.071858       7.687254      10.185667
   H       6.106394       7.112302       9.241707
   H       1.637363       5.184910       4.169264
   H       0.427645       4.908936       3.301903
   H       9.971698       7.227076       3.709104
   H      10.647901       8.579244       3.629806
   H       8.046808       5.126383       2.213838
   H       7.995317       4.290074       3.474723
   H       1.872601       7.864672       7.930401
   H       0.837635       8.186808       8.987268
   H       8.314696      10.115534       2.212519
   H       8.687134       8.667252       2.448452
    &END COORD
    &KIND H
      BASIS_SET TZV2P-GTH
      POTENTIAL GTH-PADE-q1
    &END KIND
    &KIND O
      BASIS_SET TZV2P-GTH
      POTENTIAL GTH-PADE-q6
    &END KIND
  &END SUBSYS
&END FORCE_EVAL
&GLOBAL
  PROJECT H2O-32
  RUN_TYPE MD
  PRINT_LEVEL LOW
  &TIMINGS
     THRESHOLD 0.000001
  &END
&END GLOBAL
&MOTION
  &MD
    ENSEMBLE NVE
    STEPS 1
    TIMESTEP 0.5
    TEMPERATURE 300.0
  &END MD
&END MOTION

Thanks for your help and time.

loikki avatar Mar 17 '23 09:03 loikki

The programs are being killed (Program received signal SIGABRT). So, perhaps they are exceeding the memory limit that you set for your containers?

When using MPI with Docker it's also always a good idea to increase the shared memory via --shm-size=1g.

And generally, when running with many threads one usually has to raise the stack size via ulimit -s unlimited and export OMP_STACKSIZE=64m.

oschuett avatar Apr 14 '23 19:04 oschuett

I checked a few things and it does not look like a thread issue or a memory limit. Anyway, I can do my benchmarks even with this bug. Feel free to close this issue.

Best,

loikki avatar Jun 28 '23 13:06 loikki