lammps-USER-CONP2
lammps-USER-CONP2 copied to clipboard
Segmentation Error
I'm having segmentation error while running conp simulation for 8M steps. But no error occurs when running for only a few thousand steps. Is this normal? I'm using total 200GB memory and 5GB memory per CPU. The stack trace shows the following -
[a247:1007389:0:1007389] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xd1)
==== backtrace (tid:1007389) ====
0 0x0000000000012b30 .annobin_sigaction.c() sigaction.c:0
1 0x00000000009a8b87 LAMMPS_NS::FixConp::alist_coul_cal() /anvil/projects/xxx/lammps4/lammps-patch_27May2021/src/USER-CONP2/fix_conp.cpp:1250
2 0x00000000009a8202 LAMMPS_NS::FixConp::a_cal() /anvil/projects/xxx/lammps4/lammps-patch_27May2021/src/USER-CONP2/fix_conp.cpp:814
3 0x00000000009a7a12 LAMMPS_NS::FixConp::linalg_setup() /anvil/projects/xxx/lammps4/lammps-patch_27May2021/src/USER-CONP2/fix_conp.cpp:440
4 0x00000000009a7a12 LAMMPS_NS::FixConp::setup_pre_force() /anvil/projects/xxx/lammps4/lammps-patch_27May2021/src/USER-CONP2/fix_conp.cpp:389
5 0x00000000004d2aec LAMMPS_NS::Modify::setup_pre_force() /anvil/projects/xxx/lammps4/lammps-patch_27May2021/src/modify.cpp:369
6 0x00000000005e98b8 LAMMPS_NS::Verlet::setup() /anvil/projects/xxx/lammps4/lammps-patch_27May2021/src/verlet.cpp:134
7 0x0000000000571806 LAMMPS_NS::Run::command() /anvil/projects/xxx/lammps4/lammps-patch_27May2021/src/run.cpp:175
8 0x000000000044b050 LAMMPS_NS::Input::execute_command() /anvil/projects/xxx/lammps4/lammps-patch_27May2021/src/input.cpp:795
9 0x0000000000448b0c LAMMPS_NS::Input::file() /anvil/projects/xxx/lammps4/lammps-patch_27May2021/src/input.cpp:273
10 0x000000000041a2dd main() /anvil/projects/xxx/lammps4/lammps-patch_27May2021/src/main.cpp:93
11 0x00000000000234a3 __libc_start_main() ???:0
12 0x000000000041a1ae _start() ???:0
=================================
srun: error: a247: task 24: Segmentation fault (core dumped)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 629321.0 ON a247 CANCELLED AT 2022-10-11T02:14:24 ***
slurmstepd: error: *** JOB 629321 ON a247 CANCELLED AT 2022-10-11T02:14:24 ***
=================================
srun: error: a247: task 24: Segmentation fault (core dumped)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 629321.0 ON a247 CANCELLED AT 2022-10-11T02:14:24 ***
slurmstepd: error: *** JOB 629321 ON a247 CANCELLED AT 2022-10-11T02:14:24 ***
I have had successful runs of millions of time steps, but those systems might be smaller than yours. The stack trace you attached doesn't seem to lead to a useful line in the current version of the code -- can you show me what your source code has at that line in your installation? (Or the latest git commit, or some other information about your particular version of this code.)
The code around the run setup portion has always been pretty messy so I wouldn't be surprised if there are memory bugs or other kinds of bugs lurking there still. I can only advise you to write restart files pretty often and if long runs keep giving you consistent segfaults (are you issuing several short run commands) then we'll have more information to proceed.
Below I've attached the link to the version I'm using, So the lines shown in the stack trace matches exactly that of mine - https://github.com/srtee/lammps-USER-CONP2/blob/main/fix_conp.cpp
I'm not using the "matout" keyword as I don't need the matrix calculation. Another question I want to ask which might not be relevant to this thread, does this conp fix results the exact same output as the older conp fix (zhenxingwang)? Asking this question because the older version always calculated the matrix/inverse matrix, but in your version matrix calculation is optional.
Yeah, that's a very weird place for a segfault. Unless we can find out more about why this is happening, I won't be able to figure out the error.
In general this version will not produce the same results as the old CONP code. The old code was very limited and because it did not guarantee electroneutrality the results would not be fully trustworthy.