mesa
mesa copied to clipboard
proposal: error code instead of core dump when max_allowed_nz is met
Edit: version mesa-r23.05.1
Per discussion with Yaguang Li @parallelpro:
when max_allowed_nz
exceeds the default value of 8000
! mesh adjustment
! ===============
! max_allowed_nz
! ~~~~~~~~~~~~~~
! Maximum number of grid points allowed.
! ::
max_allowed_nz = 8000
the response from MESA is a scary-looking core_dump.
Can we instead offer a termination code if nz = max_allowed_nz
?
Interesting, there should be message written to the terminal when exceeding max_allowed_nz?
https://github.com/MESAHub/mesa/blob/a965ec699b2b4978c43b0b305f40446b1aff05f0/star/private/mesh_plan.f90#L414-L417
and
https://github.com/MESAHub/mesa/blob/a965ec699b2b4978c43b0b305f40446b1aff05f0/star/private/mesh_plan.f90#L871-L874
In a brief test intentionally crashing a test_suite model, I get: "tried to increase number of mesh points beyond max allowed nz 1000
mesh_plan problem doing mesh_call_number 2009 s% model_number 2400
terminated evolution: adjust_mesh_failed termination code: adjust_mesh_failed "
Followed by a backtrace. I'm surprised it's not being returned in your attachment?
This is what I get just by setting the basic star work directory to a very small value of max_dq:
tried to increase number of mesh points beyond max allowed nz 8000
mesh_plan problem
doing mesh_call_number 1
s% model_number 1
terminated evolution: adjust_mesh_failed
termination code: adjust_mesh_failed
double free or corruption (!prev)
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0 0x72666e05a76f in ???
#1 0x72666e0ab32c in ???
#2 0x72666e05a6c7 in ???
#3 0x72666e0424b7 in ???
#4 0x72666e043394 in ???
#5 0x72666e0b52a6 in ???
#6 0x72666e0b737b in ???
#7 0x72666e0b7668 in ???
#8 0x72666e0b9e92 in ???
#9 0x4b96dc in __alloc_MOD_do2d
at ../private/alloc.f90:1816
#10 0x4b9dff in do2
at ../private/alloc.f90:1574
#11 0x4bb687 in __alloc_MOD_star_info_arrays
at ../private/alloc.f90:512
#12 0x4c21a0 in __alloc_MOD_free_arrays
at ../private/alloc.f90:281
#13 0x4c2234 in __alloc_MOD_free_star_data
at ../private/alloc.f90:235
#14 0x40f7c2 in __star_lib_MOD_free_star
at ../public/star_lib.f90:113
#15 0x42202a in __run_star_support_MOD_after_evolve_loop
at ../job/run_star_support.f90:904
#16 0x426616 in __run_star_support_MOD_run1_star
at ../job/run_star_support.f90:123
#17 0x40753c in __run_star_MOD_do_run_star
at /home/pablom/work/mesa_versions/mesa-r23.05.1/star/job/run_star.f90:26
#18 0x4075dc in run
at ../src/run.f90:16
#19 0x40761e in main
at ../src/run.f90:2
./rn: line 6: 809122 Aborted (core dumped) ./star
DATE: 2024-02-15
TIME: 16:23:34
OK thanks, it's possible that the useful termination condition message was redirected elsewhere and we missed it
I won't have time to look into this for >3 weeks but I think @mjoyceGR still has a point. Even if MESA detects the error and prints a message, why do we still hit the core dump & backtrace? My memory might be getting rusty while I'm not running MESA so much but I thought we usually exited relatively gracefully from known termination conditions.
I think @parallelpro is going to come here and post more about his data output configuration in a day or two to confirm the error message is truly missing, but a related question in the meantime:
Is there a way to set max_allowed_nz
arbitrarily high? max_allowed_nz = -1
does not work (confirmed on 23051), though setting to -1 does permit arbitrarily high upper limits in a lot of other cases. I realize there may be a legitimate limitation against this in the case of nz
given that it sets the size of the most crucial array/component.
Thank you all for the testing! I will explain a bit on my setup here. I set up a python script called driver.py
to modify an inlist template, and it contains the following snippet to initiate the mesa run and redirect all output to a log file.
print('------ MESA start ------')
os.system('sh rn > mesa_terminal_output_index{:06.0f}.txt'.format(index))
print('------ MESA done ------')
The last few lines of this mesa_terminal_output_index000000.txt
log file is shown below (it looks like truncated - does that give a clue?):
7320 7.789895 3655.816 2.955703 2.955706 1.000000 0.612600 0.000000 0.006532 0.272021 20.559330 7837 0
3.0224E+00 7.771256 1.876429 -5.649074 1.789376 -99.000000 0.387400 0.986633 0.001747 0.013447 0.841667 6
1.1999E+10 5.773430 2.959489 -2.230556 -18.794754 -8.017279 0.000000 0.000038 0.001396 0.013367 0.000E+00 varcontrol
save LOGS/profile3656.data LOGS/profile3656.data.FGONG for model 7320
save LOGS/profile3657.data LOGS/profile3657.data.FGONG for model 7322
save LOGS/profile3658.data LOGS/profile3658.data.FGONG for model 7324
save LOGS/profile3659.data LOGS/profile3659.data.FGONG for model 7326
save LOGS/profile3660.data LOGS/profile3660.data.FGONG for model 7328
7330 7.790135 3655.106 2.956606 2.956609 1.000000 0.612467 0.000000 0.006532 0.272021 20.560022 7844 0
DATE: 2024-02-07
TIME: 08:48:33
I submitted the following job to an hpc cluster which uses slurm workload management system.
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --partition=shared
#SBATCH --time=02-00:00:00 ## time format is DD-HH:MM:SS
#SBATCH --cpus-per-task=12
#SBATCH --mem=64G ## max amount of memory per node you require
#SBATCH --error=test-%A_%a.err ## %A - filled with jobid
#SBATCH --output=test-%A_%a.out ## %A - filled with jobid
#SBATCH --mail-type=BEGIN,END,FAIL,REQUEUE,TIME_LIMIT_80
#SBATCH [email protected]
## All options and environment variables found on schedMD site: http://slurm.schedmd.com/sbatch.html
# record time
date
hostname
# change to zsh
# module purge
source /home/yaguangl/custom_setup.sh
source /home/yaguangl/.zshrc
# navigate to the mesa directory
cd template_sun_0.2/
# activate astro
micromamba activate astro
sh clean
sh mk
python driver.py 0
date
Upon completion, I received the following log file test-1110897_4294967294.out
from slurm:
Wed Feb 7 00:23:17 UTC 2024
cn-02-03-06
gfortran -Wno-uninitialized -fno-range-check -fmax-errors=7 -fprotect-parens -fno-sign-zero -fbacktrace -ggdb -finit-real=snan -fopenmp -fbounds-check -Wuninitialized -Warray-bounds -ggdb -ffree-form -ffree-line-length-none -x f95-cpp-input -std=f2008 -Wno-error=tabs -I/home/yaguangl/mesa-r23.05.1/include -I../src -c ../src/run_star_extras.f90
gfortran -Wno-uninitialized -fno-range-check -fmax-errors=7 -fprotect-parens -fno-sign-zero -fbacktrace -ggdb -finit-real=snan -fopenmp -fbounds-check -Wuninitialized -Warray-bounds -ggdb -ffree-form -ffree-line-length-none -x f95-cpp-input -std=f2008 -Wno-error=tabs -I/home/yaguangl/mesa-r23.05.1/include -I../src -c /home/yaguangl/mesa-r23.05.1/star/job/run_star.f90
gfortran -Wno-uninitialized -fno-range-check -fmax-errors=7 -fprotect-parens -fno-sign-zero -fbacktrace -ggdb -finit-real=snan -fopenmp -fbounds-check -Wuninitialized -Warray-bounds -ggdb -ffree-form -ffree-line-length-none -x f95-cpp-input -std=f2008 -Wno-error=tabs -I/home/yaguangl/mesa-r23.05.1/include -I../src -c ../src/run.f90
gfortran -fopenmp -o ../star run_star_extras.o run_star.o run.o -L/home/yaguangl/mesa-r23.05.1/lib -lstar -lgyre -latm -lcolors -lturb -lstar_data -lnet -leos -lkap -lrates -lneu -lchem -linterp_2d -linterp_1d -lnum -lauto_diff -lhdf5io -lmtx -lconst -lmath -lutils `mesasdk_crmath_link` `mesasdk_lapack95_link` `mesasdk_lapack_link` `mesasdk_blas_link` `mesasdk_hdf5_link` `mesasdk_pgplot_link` -lz -lgyre
Now calculating index000000m1.000a1.9Y0.25Z0.013.mod
------ MESA start ------
------ MESA done ------
Wed Feb 7 08:48:35 UTC 2024
and the following error output test-1110897_4294967294.err
.
/home/yaguangl/mesasdk/bin/../lib/gcc/x86_64-pc-linux-gnu/13.1.0/../../../../x86_64-pc-linux-gnu/bin/ld: warning: atm_support.o: requires executable stack (because the .note.GNU-stack section is executable)
double free or corruption (!prev)
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0 0x1552585447f2 in ???
#1 0x155258543985 in ???
#2 0x1552581a7daf in ???
#3 0x1552581f454c in ???
#4 0x1552581a7d05 in ???
#5 0x15525817b7f2 in ???
#6 0x15525817c12f in ???
#7 0x1552581fe616 in ???
#8 0x15525820030b in ???
#9 0x155258202954 in ???
#10 0x4b99bc in __alloc_MOD_do2d
at ../private/alloc.f90:1816
#11 0x4ba11f in do2
at ../private/alloc.f90:1559
#12 0x4bba17 in __alloc_MOD_star_info_arrays
at ../private/alloc.f90:512
#13 0x4c2510 in __alloc_MOD_free_arrays
at ../private/alloc.f90:281
#14 0x4c25a4 in __alloc_MOD_free_star_data
at ../private/alloc.f90:235
#15 0x40f9fd in __star_lib_MOD_free_star
at ../public/star_lib.f90:788
#16 0x4223ca in __run_star_support_MOD_after_evolve_loop
at ../job/run_star_support.f90:903
#17 0x4269a6 in __run_star_support_MOD_run1_star
at ../job/run_star_support.f90:66
#18 0x407906 in __run_star_MOD_do_run_star
at /home/yaguangl/mesa-r23.05.1/star/job/run_star.f90:26
#19 0x40799e in run
at ../src/run.f90:16
#20 0x4079e0 in main
at ../src/run.f90:2
rn: line 6: 3018003 Aborted (core dumped) ./star
In all cases, the msg seen by @orlox on the mesh points is not seen on my system. Any ideas? Thanks again for looking into this!
I just added max_dq = 1d-10
to a standard star/work
folder with the latest development version of MESA and SDK 23.7.3 on Linux (Fedora 39) and didn't get the backtrace. I do get the backtrace if I also start from a main-sequence model by commenting out the lines with create_pre_main_sequence_model
, Lnuc_div_L_zams_limit
and stop_near_zams
. I'll use this to start investigating when I have another chance to have a look.
Regarding the message not showing up for @parallelpro, that sounds like there might be some buffering in the output stream and MESA crashes before everything is written out.
A simple workaround, incidentally, is to just set max_allowed_nz
to some crazy large value, like 100000. I don't mind unleashing it completely with e.g. -1 but I'm not sure how a system will behave if MESA tries to create too large a mesh. My hunch is that the OS will just kill the process, which is fine, but I'd like to rule out bringing anyone's system to an unusable halt first.
I've opened PR #630 to try to fix the core dump. I'll investigate unleashing max_allowed_nz
when I next have some MESA time.
I experimented very briefly with absurdly large values of max_allowed_nz
and I don't think there's a fundamental reason that it has to be limited. The OS should kill the job if it takes too much memory but with the basic net, MESA didn't fundamentally object to briefly creating a model with over a million mesh points, and it didn't crash the computer I tried it on.
MESA did crash during the remeshing, however, for some other reason that produced the same segfault as in this issue. There's still some pointer that we try to free before it's allocated if the remeshing fails but the one raised in this issue should now be a simple stop
rather than a segfault, following my changes in #630. If @parallelpro can confirm that the fix I've made (which can be backported) turns the current segfault into an error, this can be closed. (MESA should still crash if you don't also increase max_allowed_nz
.)
Looking further ahead, I propose that:
- We implement
max_allowed_nz = -1
as an option that means the size of the mesh is unlimited. - We retain a finite default, though 8000 is probably too low. (IMO the question is: how big a mesh indicates that something is going wrong in a calculation?)
I'll have a look at implementing -1 when I next get some MESA time™ and will start discussing 2 among the devs.
@parallelpro What choice of mesh parameters is leading to more than 8000 zones, and in what stage of evolution?
Thanks @warrickball The configuration that leads to more than 8000 zones is setting mesh_delta_coeff = 0.1
for a resolution study. For a typical 1 Msun track, the subgiant phase could easily reach >8000 zones. I can test your MESA fix in the next few days.
The github branch tied to this issue has been successfully merged. Is it alright if I close this issue?
Yup, cheers