QCFractal Task Queue and Records Status Overhaul

Problem: Current the task queue and record statuses have the following issues:

Currently do not fully overlap (e.g., WAITING/INCOMPLETE for tasks/records)
Permanently failed tasks create a backlog of ERRORs in the task queue.
Tasks cannot be put on hold to let other tasks complete first without putting them in an ERRORed state.

Suggestion: We should homogenize on the following language:

WAITING - A task/record that is queued, but has not started yet.
RUNNING - A task/record that is currently being evaluated on a manager.
COMPLETE - A task/record that has been finished, the task queue will not have complete tasks as they are deleted once finished.
ERROR - A task/record that is in an errored state, but may be restarted.
FAILED - A record that was in an errored state, but was either collected or marked as a permanent failure and removed from the task queue. This record can generate a new task to evaluate itself, but must go through a more specific API than restarting ERROR'd tasks.
HOLD - A task/record whose computation has been put on HOLD until its status has been moved to WAITING.

A record can be any of the above states while a task can only be in the WAITING/RUNNING/ERROR/HOLD states.

The distinction between ERROR/FAILED is one worth chatting about. To keep the task queue light and discard true failures over time a secondary state seems worthwhile, but does create additional special cases where we will need to be able to always regenerate a task from a record. To be clear on that distinction is record is the object that provides a record of a computation while a task is a computable representation of that record (effectively the python function to be called and the args/kwargs). Currently tasks are deleted once the record is complete, but are always kept on ERROR'd tasks.

Questions:

Are we missing additional use cases and therefore additional states?
Is the distinction from ERROR/FAILED too small? An additional option is that ERRORed states always delete their task while applying their ERROR to the underlying record. In this case restarting a record computation would always regenerate the task information from the record.
It doesn't seem Parsl/Dask/Python Futures/Kubernetes has really well defined tasks statuses, am I missing something?

Additional Information:

@mturilli @sjayellis @mattwelborn @benclifford @doaa-altarawy @ChayaSt

Dec 02 '19 20:12 dgasmith

How about a KILLED status?

Dec 03 '19 14:12 mattwelborn

What purpose would this add?

Dec 03 '19 14:12 dgasmith

It would allow cancellation and future resubmission of jobs, rather than indefinate HOLD.

Dec 03 '19 14:12 mattwelborn

So KILLED would replace HOLD?

Dec 03 '19 14:12 dgasmith

it would be different from HOLD, allowing for the cancellation of jobs. If I know that I made a mistake in submitting jobs, I might just want those jobs gone. For HOLD, I might pause some queued jobs until a resource is available, or to allow higher priority jobs to run.

Dec 03 '19 15:12 mattwelborn

KILLED seems more like an operation that removes both the task and the record. This option would be valid for when status != COMPLETE so as not to accidentally delete complete tasks.

Dec 08 '19 17:12 dgasmith

Proposed status transition table. Rows represent starting status. Columns represent ending status. Numbers represent allowed transitions. (Of course, KILLED isn't a real status...)

+----------+---------+---------+----------+-------+--------+------+--------+
|          | WAITING | RUNNING | COMPLETE | ERROR | FAILED | HOLD | KILLED |
+----------+---------+---------+----------+-------+--------+------+--------+
| WAITING  |         | 1       |          |       |        | 2    | 3      |
+----------+---------+---------+----------+-------+--------+------+--------+
| RUNNING  |         |         | 4        | 5     |        |      | 6      |
+----------+---------+---------+----------+-------+--------+------+--------+
| COMPLETE |         |         |          |       |        |      |        |
+----------+---------+---------+----------+-------+--------+------+--------+
| ERROR    | 10      |         |          |       | 7      |      |        |
+----------+---------+---------+----------+-------+--------+------+--------+
| FAILED   |         |         |          |       |        |      |        |
+----------+---------+---------+----------+-------+--------+------+--------+
| HOLD     | 8       |         |          |       |        |      | 9      |
+----------+---------+---------+----------+-------+--------+------+--------+
| KILLED   |         |         |          |       |        |      |        |
+----------+---------+---------+----------+-------+--------+------+--------+

1. Job accepted by manager.
2. User sends hold signal.
3. User sends kill signal.
4. Job finished successfully.
5. Job fails.
6. User sends kill signal.
7. Time (1 month?) passes.
8. User sends release signal.
9. User sends kill signal.
10. User sends retry signal.

Dec 20 '19 19:12 mattwelborn

I think this is starting to bite us @bennybp, @trevorgokey. Do we want to stick with the 0.14.0 milestone on this? Improving the situation on error management could be the big theme for that release.

May 29 '20 21:05 dotsdl

Lots more statuses (and ability to change statuses) in v0.50!

Sep 14 '23 16:09 bennybp

QCFractal QCFractal copied to clipboard

Task Queue and Records Status Overhaul

QCFractal
QCFractal copied to clipboard