custodian icon indicating copy to clipboard operation
custodian copied to clipboard

[Bug]: Custodian cannot be run safely when the executable is on a separate node

Open Andrew-S-Rosen opened this issue 4 months ago • 0 comments

What happened?

For full context of this issue, refer to the summary in https://github.com/materialsproject/custodian/pull/396.

Custodian might end up running on a master node with the VASP processes being launched on sister nodes. This is often done, for instance, when requesting a single large Slurm allocation and running many concurrent VASP processes therein. Currently, Custodian cannot handle this setup, as the Custodian process on the master node seemingly does not have permission to kill the VASP process on the other node(s) in the allocation, and it then defaults to a killall command killing everything (including perfectly fine jobs). However, Custodian does have permission to kill the parent process that launches the VASP executable (typically an srun or mpirun call), which in fact is what the killall indiscriminately kills.

#396 solves this for VASP, but essentially the same problem exists for the other codes. The fix in #396 is quite easy to implement for other codes once it is merged.

Version

2025.8.13

Which OS?

  • [ ] MacOS
  • [ ] Windows
  • [ ] Linux

Log output


Andrew-S-Rosen avatar Sep 04 '25 01:09 Andrew-S-Rosen