Kill VASP on reaching `MaxCorrectionsPerJobError`
we noticed that a lot of our nodes had zombie VASP processes that we're hogging compute and memory without doing anything useful after hitting MaxCorrectionsPerJobError, preventing new jobs from starting properly (they we're showing as running but not doing anything). it appears that's because Custodian._run_job had no logic to actually kill VASP when the default 5 errors per jobs is reached. is this by design? i'm surprised this didn't come up before so maybe we're missing something? fwiw, we're no longer seeing the zombie VASP processes after redeploying with the changes in this PR.
@Andrew-S-Rosen @esoteric-ephemera @shyuep
@janosh: Before proceeding with this PR, can you please tell me report what version of Custodian you are using? I modified the termination behavior in https://github.com/materialsproject/custodian/pull/396. If you are using v2025.10.11, can you please check if 2025.8.13 prevents the issue from occurring anymore? I would like to confirm that it is unrelated before proceeding with this PR.
yes, using v2025.10.11. i looked at #396 before starting work on this, thinking it could be related. if it is, at least it's not obvious
Looking at your PR, indeed, it does not seem like it would be related (although you never know). I remain just as surprised as you that this behavior would not have been caught before now. I will try to find time to test this in production before it is merged, but it will take me a little bit.
Codecov Report
:x: Patch coverage is 41.66667% with 21 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 52.89%. Comparing base (a305609) to head (fee3e62).
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/custodian/custodian.py | 41.66% | 21 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #410 +/- ##
==========================================
- Coverage 53.02% 52.89% -0.14%
==========================================
Files 39 39
Lines 3508 3528 +20
==========================================
+ Hits 1860 1866 +6
- Misses 1648 1662 +14
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@janosh --- As a follow-up question, if the VASP job errors out, wouldn't it not be running anymore?
@janosh --- As a follow-up question, if the VASP job errors out, wouldn't it not be running anymore?
i was wondering under what circumstances Custodian might think VASP errored out but where the VASP process is actually still running. is that possible?
i was wondering under what circumstances Custodian might think VASP errored out but where the VASP process is actually still running. is that possible?
It would need to be one that has is_monitor=True like NonConverigngErrorHandler, but in those cases, it should be killing the active VASP job regardless of the maximum number of corrections.
Can you confirm if the lingering VASP process was doing any I/O at all, or if it was complete stale? Maybe you can also share the errors in your custodian.json?
Personally, I haven't had this issue - jobs which hit that limit seem to get killed. Is it maybe some interplay between the job orchestration software and the sigkill custodian sends?
I think there are instances where the VASP process is still running on a HPC node even though it is killed on other nodes. We have seen this before.
@janosh: Just a brief update from me. I just used Custodian 2025.10.11 and had a toy job hit the maximum number of errors. Everything aborted as expected, with no lingering VASP process. The relevant logging info is below:
custodian.custodian.MaxCorrectionsPerJobError: Max errors per job reached: 2.
So, this at least means I did not break anything in my PR as I was inquiring about. But it also means that there are mechanisms in place to properly abort without your proposed changes.
It would be very helpful to get the logging module stdout from the failed run you are referring to, if it happens again. There is some useful information in here for debugging purposes. Of course, if you are confident that this PR fixes your issue, please let me know. I ran with your PR and confirmed that it at least didn't drastically ruin anything on that same toy run.