boinc
boinc copied to clipboard
stuck jobs
Reportedly, some VM jobs (and possibly others) get in a "stuck" state where they don't make progress: no fraction done change, and little CPU usage. These jobs will eventually be aborted when their elapsed time reaches the rsc_fpops_bound limit, but this could take weeks or months depending on the limit.
Proposal: have the client try to figure out when a job is stuck.
ACTIVE_TASK new fields:
double stuck_check_elapsed_time
double stuck_check_fraction_done
double stuck_check_cpu_time
(initialize all to zero)
STUCK_CHECK_POLL_PERIOD = 3600
every STUCK_CHECK_POLL_PERIOD seconds
for each active task atp
if non_cpu_intensive: continue
if sporadic: continue
if atp->stuck_check_elapsed_time == 0
atp->stuck_check_elapsed_time = atp->elapsed_time
atp->stuck_check_fraction_done = atp->fraction_done
atp->stuck_check_cpu_time = atp->current_cpu_time
continue
if atp->elapsed_time < atp->stuck_check_elapsed_time + STUCK_CHECK_POLL_PERIOD
continue
if atp->stuck_check_fraction_done == atp->fraction_done
&& (atp->current_cpu_time - atp->stuck_check_cpu_time < 10)
(job is stuck - print warning)
atp->stuck_check_elapsed_time = atp->elapsed_time
atp->stuck_check_fraction_done = atp->fraction_done
atp->stuck_check_cpu_time = atp->current_cpu_time
e.g. in the last hour of running, the fraction done hasn't changed, and the incremental CPU time is < 10s.
At that point, the client could
- notify the user, suggesting that they abort the job
- abort the job
Let's do 1) for starters, to make sure that the logic is right, then at some point do 2).
Hello,
I'm Franke Tang, a graduate student currently taking a Distributed Computing course, and part of my final project encourages us to contribute to open issues on GitHub relating to distributed systems. I would like to work on this issue if this has not been implemented yet.
Welcome, @FTang21, sure, go ahead
Hello, sorry for the late followup, was working on PRs on other repos. I was looking through code, would app.cpp be a good point to start on this issue?
The new logic would go in ACTIVE_TASK_SET::poll()