Random crash with ubuntu package
Hi,
I am trying to use CP2K from apt-get (ubuntu 22.04 and 23.04, version 2023.1-2 and 9.1-2) within a container (docker for building, singularity for running). Unfortunately, I get some random crashes (happens roughly once every 200-300 runs). I have no experience with CP2K and uses it only for some benchmarking, so I might be doing something stupid.
Here is the error:
Mar 17 09:34:55 NAME bash[38838]: Program received signal SIGABRT: Process abort signal.
Mar 17 09:34:55 NAME bash[38838]: Backtrace for this error:
Mar 17 09:34:55 NAME bash[38838]: #0 0x7ff350622ad0 in ???
Mar 17 09:34:55 NAME bash[38838]: #1 0x7ff350621c35 in ???
Mar 17 09:34:55 NAME bash[38838]: #2 0x7ff3501cb51f in ???
Mar 17 09:34:55 NAME bash[38838]: #3 0x7ff35021fa7c in pthread_kill
Mar 17 09:34:55 NAME bash[38838]: #4 0x7ff3501cb475 in raise
Mar 17 09:34:55 NAME bash[38838]: #5 0x7ff3501b17f2 in abort
Mar 17 09:34:55 NAME bash[38838]: #6 0x55a5cb392ef0 in ???
Mar 17 09:34:55 NAME bash[38838]: #7 0x55a5c93ecace in ???
Mar 17 09:34:55 NAME bash[38838]: #8 0x55a5c953dc01 in ???
Mar 17 09:34:55 NAME bash[38838]: #9 0x55a5c9427233 in ???
Mar 17 09:34:55 NAME bash[38838]: #10 0x55a5c9427cc3 in ???
Mar 17 09:34:55 NAME bash[38838]: #11 0x55a5c9326860 in ???
Mar 17 09:34:55 NAME bash[38838]: #12 0x55a5c9329744 in ???
Mar 17 09:34:55 NAME bash[38838]: #13 0x55a5c9323e61 in ???
Mar 17 09:34:55 NAME bash[38838]: #14 0x55a5c9321eee in ???
Mar 17 09:34:55 NAME bash[38838]: #15 0x7ff3501b2d8f in ???
Mar 17 09:34:55 NAME bash[38838]: #16 0x7ff3501b2e3f in __libc_start_main
Mar 17 09:34:55 NAME bash[38838]: #17 0x55a5c9321f24 in ???
Mar 17 09:34:55 NAME bash[38838]: #18 0xffffffffffffffff in ???
Here is my dockerfile:
# Final image
FROM ubuntu:23.04
ARG CP2K_VERSION=2023.1-2
# Install packages
RUN apt update && \
apt install -y --no-install-recommends bc numactl \
python3 python3-pip libglu1-mesa freeglut3-dev cp2k=${CP2K_VERSION} \
openmpi-bin libopenblas0-openmp && \
This is how I run CP2K. I have 128 cores on the machine (2x AMD EPYC 7742), so a process crashes every 2-3 runs:
# Run CP2K on 1 core NUM_PROCS time
for i in $(seq 0 $(($NUM_PROCS-1)));
do
numactl --physcpubind=$i cp2k.popt -i ${INPUT} -o out-$i.log &
done
wait
Here is the input file:
&FORCE_EVAL
METHOD QS
&DFT
BASIS_SET_FILE_NAME GTH_BASIS_SETS
POTENTIAL_FILE_NAME POTENTIAL
&MGRID
CUTOFF 280
REL_CUTOFF 30
&END MGRID
&QS
EPS_DEFAULT 1.0E-12
WF_INTERPOLATION PS
EXTRAPOLATION_ORDER 3
&END QS
&SCF
SCF_GUESS ATOMIC
&OT ON
MINIMIZER DIIS
&END OT
# SCF_GUESS RESTART
# EPS_SCF 1.0E-7
MAX_SCF 5
&PRINT
&RESTART OFF
&END
&END
&END SCF
&XC
&XC_FUNCTIONAL Pade
&END XC_FUNCTIONAL
&END XC
&END DFT
&SUBSYS
&CELL
ABC 9.8528 9.8528 9.8528
&END CELL
# 32 H2O (TIP5P,1bar,300K) a = 9.8528
&COORD
O 2.280398 9.146539 5.088696
O 1.251703 2.406261 7.769908
O 1.596302 6.920128 0.656695
O 2.957518 3.771868 1.877387
O 0.228972 5.884026 6.532308
O 9.023431 6.119654 0.092451
O 7.256289 8.493641 5.772041
O 5.090422 9.467016 0.743177
O 6.330888 7.363471 3.747750
O 7.763819 8.349367 9.279457
O 8.280798 3.837153 5.799282
O 8.878250 2.025797 1.664102
O 9.160372 0.285100 6.871004
O 4.962043 4.134437 0.173376
O 2.802896 8.690383 2.435952
O 9.123223 3.549232 8.876721
O 1.453702 1.402538 2.358278
O 6.536550 1.146790 7.609732
O 2.766709 0.881503 9.544263
O 0.856426 2.075964 5.010625
O 6.386036 1.918950 0.242690
O 2.733023 4.452756 5.850203
O 4.600039 9.254314 6.575944
O 3.665373 6.210561 3.158420
O 3.371648 6.925594 7.476036
O 5.287920 3.270653 6.155080
O 5.225237 6.959594 9.582991
O 0.846293 5.595877 3.820630
O 9.785620 8.164617 3.657879
O 8.509982 4.430362 2.679946
O 1.337625 8.580920 8.272484
O 8.054437 9.221335 1.991376
H 1.762019 9.820429 5.528454
H 3.095987 9.107088 5.588186
H 0.554129 2.982634 8.082024
H 1.771257 2.954779 7.182181
H 2.112148 6.126321 0.798136
H 1.776389 7.463264 1.424030
H 3.754249 3.824017 1.349436
H 3.010580 4.524142 2.466878
H 0.939475 5.243834 6.571945
H 0.515723 6.520548 5.877445
H 9.852960 6.490366 0.393593
H 8.556008 6.860063 -0.294256
H 7.886607 7.941321 6.234506
H 7.793855 9.141028 5.315813
H 4.467366 9.971162 0.219851
H 5.758685 10.102795 0.998994
H 6.652693 7.917443 3.036562
H 6.711966 7.743594 4.539279
H 7.751955 8.745180 10.150905
H 7.829208 9.092212 8.679343
H 8.312540 3.218330 6.528858
H 8.508855 4.680699 6.189990
H 9.742249 1.704975 1.922581
H 8.799060 2.876412 2.095861
H 9.505360 1.161677 6.701213
H 9.920117 -0.219794 7.161006
H 4.749903 4.186003 -0.758595
H 5.248010 5.018415 0.403676
H 3.576065 9.078451 2.026264
H 2.720238 9.146974 3.273164
H 9.085561 4.493058 9.031660
H 9.215391 3.166305 9.749133
H 1.999705 2.060411 1.927796
H 1.824184 0.564565 2.081195
H 7.430334 0.849764 7.438978
H 6.576029 1.537017 8.482885
H 2.415851 1.576460 8.987338
H 2.276957 0.099537 9.289499
H 1.160987 1.818023 4.140602
H 0.350256 2.874437 4.860741
H 5.768804 2.638450 0.375264
H 7.221823 2.257514 0.563730
H 3.260797 5.243390 5.962382
H 3.347848 3.732214 5.988196
H 5.328688 9.073059 5.982269
H 5.007063 9.672150 7.334875
H 4.566850 6.413356 3.408312
H 3.273115 7.061666 2.963521
H 3.878372 7.435003 6.843607
H 3.884673 6.966316 8.283117
H 5.918240 3.116802 5.451335
H 5.355924 2.495093 6.711958
H 5.071858 7.687254 10.185667
H 6.106394 7.112302 9.241707
H 1.637363 5.184910 4.169264
H 0.427645 4.908936 3.301903
H 9.971698 7.227076 3.709104
H 10.647901 8.579244 3.629806
H 8.046808 5.126383 2.213838
H 7.995317 4.290074 3.474723
H 1.872601 7.864672 7.930401
H 0.837635 8.186808 8.987268
H 8.314696 10.115534 2.212519
H 8.687134 8.667252 2.448452
&END COORD
&KIND H
BASIS_SET TZV2P-GTH
POTENTIAL GTH-PADE-q1
&END KIND
&KIND O
BASIS_SET TZV2P-GTH
POTENTIAL GTH-PADE-q6
&END KIND
&END SUBSYS
&END FORCE_EVAL
&GLOBAL
PROJECT H2O-32
RUN_TYPE MD
PRINT_LEVEL LOW
&TIMINGS
THRESHOLD 0.000001
&END
&END GLOBAL
&MOTION
&MD
ENSEMBLE NVE
STEPS 1
TIMESTEP 0.5
TEMPERATURE 300.0
&END MD
&END MOTION
Thanks for your help and time.
The programs are being killed (Program received signal SIGABRT). So, perhaps they are exceeding the memory limit that you set for your containers?
When using MPI with Docker it's also always a good idea to increase the shared memory via --shm-size=1g.
And generally, when running with many threads one usually has to raise the stack size via ulimit -s unlimited and export OMP_STACKSIZE=64m.
I checked a few things and it does not look like a thread issue or a memory limit. Anyway, I can do my benchmarks even with this bug. Feel free to close this issue.
Best,