custodian Kill VASP on reaching `MaxCorrectionsPerJobError`

we noticed that a lot of our nodes had zombie VASP processes that we're hogging compute and memory without doing anything useful after hitting MaxCorrectionsPerJobError, preventing new jobs from starting properly (they we're showing as running but not doing anything). it appears that's because Custodian._run_job had no logic to actually kill VASP when the default 5 errors per jobs is reached. is this by design? i'm surprised this didn't come up before so maybe we're missing something? fwiw, we're no longer seeing the zombie VASP processes after redeploying with the changes in this PR.

@Andrew-S-Rosen @esoteric-ephemera @shyuep

Oct 30 '25 21:10 janosh

@janosh: Before proceeding with this PR, can you please tell me report what version of Custodian you are using? I modified the termination behavior in https://github.com/materialsproject/custodian/pull/396. If you are using v2025.10.11, can you please check if 2025.8.13 prevents the issue from occurring anymore? I would like to confirm that it is unrelated before proceeding with this PR.

Oct 30 '25 21:10 Andrew-S-Rosen

yes, using v2025.10.11. i looked at #396 before starting work on this, thinking it could be related. if it is, at least it's not obvious

Oct 30 '25 21:10 janosh

Looking at your PR, indeed, it does not seem like it would be related (although you never know). I remain just as surprised as you that this behavior would not have been caught before now. I will try to find time to test this in production before it is merged, but it will take me a little bit.

Oct 30 '25 21:10 Andrew-S-Rosen

Codecov Report

:x: Patch coverage is 41.66667% with 21 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 52.89%. Comparing base (a305609) to head (fee3e62).

Files with missing lines	Patch %	Lines
src/custodian/custodian.py	41.66%	21 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #410      +/-   ##
==========================================
- Coverage   53.02%   52.89%   -0.14%     
==========================================
  Files          39       39              
  Lines        3508     3528      +20     
==========================================
+ Hits         1860     1866       +6     
- Misses       1648     1662      +14

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Oct 30 '25 21:10 codecov-commenter

@janosh --- As a follow-up question, if the VASP job errors out, wouldn't it not be running anymore?

Oct 30 '25 21:10 Andrew-S-Rosen

@janosh --- As a follow-up question, if the VASP job errors out, wouldn't it not be running anymore?

i was wondering under what circumstances Custodian might think VASP errored out but where the VASP process is actually still running. is that possible?

Oct 30 '25 21:10 janosh

i was wondering under what circumstances Custodian might think VASP errored out but where the VASP process is actually still running. is that possible?

It would need to be one that has is_monitor=True like NonConverigngErrorHandler, but in those cases, it should be killing the active VASP job regardless of the maximum number of corrections.

Can you confirm if the lingering VASP process was doing any I/O at all, or if it was complete stale? Maybe you can also share the errors in your custodian.json?

Oct 30 '25 21:10 Andrew-S-Rosen

Personally, I haven't had this issue - jobs which hit that limit seem to get killed. Is it maybe some interplay between the job orchestration software and the sigkill custodian sends?

Oct 30 '25 21:10 esoteric-ephemera

I think there are instances where the VASP process is still running on a HPC node even though it is killed on other nodes. We have seen this before.

Oct 30 '25 22:10 shyuep

@janosh: Just a brief update from me. I just used Custodian 2025.10.11 and had a toy job hit the maximum number of errors. Everything aborted as expected, with no lingering VASP process. The relevant logging info is below:

custodian.custodian.MaxCorrectionsPerJobError: Max errors per job reached: 2.

So, this at least means I did not break anything in my PR as I was inquiring about. But it also means that there are mechanisms in place to properly abort without your proposed changes.

It would be very helpful to get the logging module stdout from the failed run you are referring to, if it happens again. There is some useful information in here for debugging purposes. Of course, if you are confident that this PR fixes your issue, please let me know. I ran with your PR and confirmed that it at least didn't drastically ruin anything on that same toy run.

Nov 01 '25 00:11 Andrew-S-Rosen