Axel Huebl
Axel Huebl
Added in https://github.com/AMReX-Codes/pyamrex/pull/370
cc @snicks11 @rezaplasma @fredericaliu @HLQzZ @hnakahara79 @baobaba13 @xiaowang119 @BifengLei @DanielWinklehner @tmiethlinger @HuntFeng @wphu @ppiot @kookjine @eebasso @fickas @shefalys @prochairbss @chandanthakur-phy @stannnnnnnnn @jcyu96 @wangjia-ai @bssharma1958 @jjvdwOX @zhuruihu @philmartin01 @xiaowang119 @shekhar4091 @Xinying-Wang-CS...
A simple way to monitor locally from the job script would use the `mtime` (i.e. `ls -l`) of our `output.txt`, which is periodically updated from progress status on `stdout`. Logic...
Note: we'll add this carefully to HPC template scripts. We need to double-check if this works well with the SBATCH signals, e.g., 10min to walltime.
Ok, very good point. We should do two improvements: - check if the PID still exists (but can be re-assigned) - add `#SBATCH --kill-on-bad-exit=1` - maybe add a `break` for...
I am thinking these updates: ```bash # ... srun --kill-on-bad-exit=1 [...] > output_${SLURM_JOBID}.txt & srun_pid=$! timeout_sec=300 # timeout: 5min while true do if kill -0 "$srun_pid" then # signal delivered,...
Another improvement we can do: do a sigterm (`kill -15`) first, wait 5min to let it write an AMReX backtrace, then `kill -9` as a sigkill.
@titoiride do you mind posting the latest version of this script again? :)
@WeiqunZhang @dpgrote et al., please feel free to push here as needed :pray:
@WeiqunZhang this is an interesting one: On NERSC PM it fails with: ``` MPICH ERROR [Rank 56] [job id 39480540.0] [Tue Jun 10 06:09:48 2025] [nid003912] - Abort(1660687) (rank 56...