Fix: Avoid mutating `JobsList._polling_jobs` inside slurm scheduler
The joy of working with mutable types and passing by reference 🥲
Pair-coded with @khsrali 🫶
Codecov Report
:x: Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 79.60%. Comparing base (d94a472) to head (e22f342).
:warning: Report is 1 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/aiida/engine/processes/calcjobs/manager.py | 80.00% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #7155 +/- ##
==========================================
- Coverage 79.60% 79.60% -0.00%
==========================================
Files 566 566
Lines 43587 43585 -2
==========================================
- Hits 34694 34692 -2
Misses 8893 8893
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Thanks for the review, @danielhollas, all very good points! I applied your suggestions and will add you to the squash-merge message 🫶
Ooof, good catch. I was incidentally recently looking at this code since I was adding typing and did not see the issue. 🙈
Well, it does appear elsewhere, no way to know what's wrong from just looking at manager.py 🥲
Can I ask what was the original bug and how did you verify that this fixes this? Would be good to have a issue written up. (also would be great to have a test, but not sure if we have some existing tests that could be improved or if it would be a lot of work to test this).
Actually, I found the bug in the SwissTwins project, the tech stack of which is quite extensive (and fragile and hacky 😅), so difficult to reproduce, really. When developing workflows there, I was getting this warning in the report: https://github.com/aiidateam/aiida-core/blob/6ee16094e0c52ef423670eaf216b1fa643f51dff/src/aiida/engine/processes/calcjobs/manager.py?plain=1#L156-L159 and my workflow crashed further down the line. Makes sense that something fishy was going on there, as the code path should be unreachable..
So, @khsrali and I started digging into this, and found that self._polling_jobs contained the same job_id twice, which, when iterating over it and popping from a dictionary with job_ids as keys in the loop would lead to the warning, as the same key was trying to be popped twice from the dictionary.
Now, the typing issue in scheduler.py and the fact that in slurm.py the input gets actually mutated were already there before, but didn't cause problems. With the changes from #7601, they started appearing, though. I don't think it makes sense to add a regression test in that case, as we:
- would test old, buggy code,
- with the typing change now, it should be fine,
- we still have the warning on the unreachable path,
- and, apart from that, current test coverage should be sufficient.
Actually, I'm now wondering if this is generally causing issues for everyone on v2.7.2, so maybe @mbercx has already done some field testing and can report back?
Actually, I found the bug in the SwissTwins project, the tech stack of which is quite extensive (and fragile and hacky 😅), so difficult to reproduce, really. When developing workflows there, I was getting this warning in the report:
Oh, nice, I am glad we added that warning there! :-)
and my workflow crashed further down the line. Makes sense that something fishy was going on there, as the code path should be unreachable..
Hmm, but I don't think that the bug fixed in this PR should caused the crash though since the code as written should just skip over the duplicate job id right?
If that is not the case, than indeed we'd probably need to publish this in 2.7.3. I guess people would report this.
Hmm, but I don't think that the bug fixed in this PR should caused the crash though since the code as written should just skip over the duplicate job id right?
Actually, yes, good point. We indeed saw this appearing in my other project in the past already, but as we also deemed it a non-critical failure and the workflows were running through successfully, we didn't want to delay the v2.7.2 release further. However, I'm now running more complex workflows in the project (nested workgraphs with async monitor functions that retrieve the job_id and remote workdir, and SLURM pre-submission), and there this bug here seems to cause problems downstream (~which, and how, I don't fully understand yet~ EDIT: see below). I do think for "normal" workflows, it might actually be OK. As we plan v2.8.0 anyway mid / end of January, could be OK to include it there (but probably still best to add this fix here to support/2.7.x branch, I guess). What do you think?
Possible failure mechanism in my SwissTwins workflow
Key insight: With duplicates, the second iteration resolves a future to None when job info is no longer in the cache.
Scenario with the mutation bug:
T=0: Job 503905 submits, job_id set on node
T=1: CalcJob requests status update → Future F1 created
T=2: Polling cycle starts
- _polling_jobs = ['503905'] (ONE job)
- Pass to scheduler: scheduler.get_jobs(jobs=_polling_jobs) ← SAME LIST (reference!)
- SLURM mutates: _polling_jobs → ['503905', '503905']
- Returns: _jobs_cache = {'503905': JobInfo(RUNNING)}
T=3: Loop iteration 1 (job_id='503905')
- future = pop('503905') → gets F1, REMOVES from dict
- F1.set_result(JobInfo(RUNNING)) → F1 resolves ✓
T=4: Job 503905 completes VERY QUICKLY (5 seconds)
- SLURM marks job as DONE
- Job gets PURGED from scheduler (no longer in queue)
T=5: CalcJob receives F1 result, processes it
- Calls request_job_info_update('503905') AGAIN (retry/next poll)
- Creates NEW future F2 for '503905'
T=6: Loop iteration 2 (job_id='503905' - DUPLICATE!)
- future = pop('503905') → gets F2!
- _jobs_cache.get('503905', None) → returns NONE! (job was purged!)
- F2.set_result(None) → F2 resolves to None!
T=7: CalcJob receives F2=None
- tasks.py line 198-200: "If job_info is None... assume it's done"
- Sets node.set_scheduler_state(JobState.DONE)
- BUT: State corruption occurs, job_id might not be fully processed
T=8: get_job_data polling
- Calls node.get_job_id() → returns None (state corrupted!)
- Retries every 10 seconds
- After 3600 seconds → TIMEOUT!
With frozenset fix:
T=2: - _polling_jobs = frozenset(['503905'])
- Pass to scheduler: scheduler.get_jobs(jobs=list(_polling_jobs)) ← NEW LIST (copy!)
- SLURM mutates the COPY: ['503905', '503905']
- _polling_jobs UNCHANGED: frozenset(['503905']) ← STILL ONE ELEMENT!
T=3: Loop iterates over frozenset → ONLY ONE ITERATION
- future = pop('503905') → gets F1
- F1.set_result(JobInfo(...)) → resolved correctly ✓
- NO second iteration!
T=5-8: Normal processing, no state corruption ✓