QCFractal
QCFractal copied to clipboard
Task Queue and Records Status Overhaul
Problem: Current the task queue and record statuses have the following issues:
- Currently do not fully overlap (e.g., WAITING/INCOMPLETE for tasks/records)
- Permanently failed tasks create a backlog of ERRORs in the task queue.
- Tasks cannot be put on hold to let other tasks complete first without putting them in an ERRORed state.
Suggestion: We should homogenize on the following language:
- WAITING - A task/record that is queued, but has not started yet.
- RUNNING - A task/record that is currently being evaluated on a manager.
- COMPLETE - A task/record that has been finished, the task queue will not have complete tasks as they are deleted once finished.
- ERROR - A task/record that is in an errored state, but may be restarted.
- FAILED - A record that was in an errored state, but was either collected or marked as a permanent failure and removed from the task queue. This record can generate a new task to evaluate itself, but must go through a more specific API than restarting ERROR'd tasks.
- HOLD - A task/record whose computation has been put on HOLD until its status has been moved to WAITING.
A record can be any of the above states while a task can only be in the WAITING/RUNNING/ERROR/HOLD states.
The distinction between ERROR/FAILED is one worth chatting about. To keep the task queue light and discard true failures over time a secondary state seems worthwhile, but does create additional special cases where we will need to be able to always regenerate a task from a record. To be clear on that distinction is record is the object that provides a record of a computation while a task is a computable representation of that record (effectively the python function to be called and the args/kwargs). Currently tasks are deleted once the record is complete, but are always kept on ERROR'd tasks.
Questions:
- Are we missing additional use cases and therefore additional states?
- Is the distinction from ERROR/FAILED too small? An additional option is that ERRORed states always delete their task while applying their ERROR to the underlying record. In this case restarting a record computation would always regenerate the task information from the record.
- It doesn't seem Parsl/Dask/Python Futures/Kubernetes has really well defined tasks statuses, am I missing something?
Additional Information:
@mturilli @sjayellis @mattwelborn @benclifford @doaa-altarawy @ChayaSt
How about a KILLED status?
What purpose would this add?
It would allow cancellation and future resubmission of jobs, rather than indefinate HOLD.
So KILLED would replace HOLD?
it would be different from HOLD, allowing for the cancellation of jobs. If I know that I made a mistake in submitting jobs, I might just want those jobs gone. For HOLD, I might pause some queued jobs until a resource is available, or to allow higher priority jobs to run.
KILLED seems more like an operation that removes both the task and the record. This option would be valid for when status != COMPLETE so as not to accidentally delete complete tasks.
Proposed status transition table. Rows represent starting status. Columns represent ending status. Numbers represent allowed transitions. (Of course, KILLED isn't a real status...)
+----------+---------+---------+----------+-------+--------+------+--------+
| | WAITING | RUNNING | COMPLETE | ERROR | FAILED | HOLD | KILLED |
+----------+---------+---------+----------+-------+--------+------+--------+
| WAITING | | 1 | | | | 2 | 3 |
+----------+---------+---------+----------+-------+--------+------+--------+
| RUNNING | | | 4 | 5 | | | 6 |
+----------+---------+---------+----------+-------+--------+------+--------+
| COMPLETE | | | | | | | |
+----------+---------+---------+----------+-------+--------+------+--------+
| ERROR | 10 | | | | 7 | | |
+----------+---------+---------+----------+-------+--------+------+--------+
| FAILED | | | | | | | |
+----------+---------+---------+----------+-------+--------+------+--------+
| HOLD | 8 | | | | | | 9 |
+----------+---------+---------+----------+-------+--------+------+--------+
| KILLED | | | | | | | |
+----------+---------+---------+----------+-------+--------+------+--------+
1. Job accepted by manager.
2. User sends hold signal.
3. User sends kill signal.
4. Job finished successfully.
5. Job fails.
6. User sends kill signal.
7. Time (1 month?) passes.
8. User sends release signal.
9. User sends kill signal.
10. User sends retry signal.
I think this is starting to bite us @bennybp, @trevorgokey. Do we want to stick with the 0.14.0 milestone on this? Improving the situation on error management could be the big theme for that release.
Lots more statuses (and ability to change statuses) in v0.50!