boinc icon indicating copy to clipboard operation
boinc copied to clipboard

Add check for stuck jobs in poll()

Open FTang21 opened this issue 2 years ago • 10 comments

Fixes #5352

Description of the Change Add a way to check if jobs were stuck by adding an addition poll that occurs every hour, it checks if an active_tasks fraction_done doesn't change and current_cpu_time < 10s.

Alternate Designs

Release Notes

FTang21 avatar Dec 09 '23 22:12 FTang21

This might be an issue for really big tasks (e.g. ClimatePrediction.net)

AenBleidd avatar Dec 09 '23 22:12 AenBleidd

Would this be more of an issue of an hour is too short of time frame, or more with the implementation?

FTang21 avatar Dec 09 '23 22:12 FTang21

Codecov Report

Merging #5451 (eb07ea0) into master (c02d6e0) will not change coverage. Report is 16 commits behind head on master. The diff coverage is n/a.

Additional details and impacted files
@@            Coverage Diff            @@
##             master    #5451   +/-   ##
=========================================
  Coverage     10.84%   10.84%           
  Complexity     1068     1068           
=========================================
  Files           279      279           
  Lines         36156    36156           
  Branches       8355     8355           
=========================================
  Hits           3920     3920           
  Misses        31842    31842           
  Partials        394      394           

see 1 file with indirect coverage changes

codecov[bot] avatar Dec 09 '23 22:12 codecov[bot]

@FTang21, ah, no, my bad: in the original proposal there was an additional verification method for long running jobs: CPU time, and I completely missed that this was the part of the implementation in this PR.

@davidpanderson, could you please review this and verify that this is a desired implementation of the original proposal?

AenBleidd avatar Dec 09 '23 22:12 AenBleidd

There are various problems with this. I updated the issue to clarify what needs to be done: https://github.com/BOINC/boinc/issues/5352

davidpanderson avatar Dec 10 '23 00:12 davidpanderson

@davidpanderson I updated the implementation based on the updated issue. Lmk if this is ok. Should I add the abort on its own after some time or this is fine for now? Would this also be preferable as it function?

FTang21 avatar Dec 10 '23 01:12 FTang21

For now, let's just show a message using msg_printf(atp->project, MSG_USER_ALERT...)

... telling the user which job is stuck, and that they should consider aborting it.

This will be useful for testing because we can see the stuck job and decide if it's really stuck.

davidpanderson avatar Dec 10 '23 02:12 davidpanderson

Gotcha, I updated it to MSG_USER_ALERT

FTang21 avatar Dec 10 '23 02:12 FTang21

Almost but not quite. Please review my pseudo-code.

davidpanderson avatar Dec 11 '23 01:12 davidpanderson

Ah I see what I missed, it should match the order provided.

FTang21 avatar Dec 11 '23 04:12 FTang21