procrastinate
procrastinate copied to clipboard
A canceled task that completes fails with "Job was not found or not in "doing" or "todo" status"
A task that is canceled has it status set to FAILED, however when the task completes only DOING and TODO is accepted for task completion. This causes procrastinate to fail and terminate it's process.
https://github.com/procrastinate-org/procrastinate/blob/65829b3fc17374985ee17c2911083f60132f1bfd/procrastinate/shell.py#L142 https://github.com/procrastinate-org/procrastinate/blob/65829b3fc17374985ee17c2911083f60132f1bfd/procrastinate/sql/schema.sql#L207
Dec 30 16:24:00 host1 python3[1280546]: Error:
Dec 30 16:24:00 host1 python3[1280546]: Database error.
Dec 30 16:24:00 host1 python3[1280546]:
Dec 30 16:24:00 host1 python3[1280546]: Job was not found or not in "doing" or "todo" status (job id: 47)
Dec 30 16:24:00 host1 python3[1280546]: CONTEXT: PL/pgSQL function procrastinate_finish_job(integer,procrastinate_job_status,timestamp with time zone,boolean) line 24 at RAISE
Dec 30 16:24:00 host1 systemd[1]: procrastinate.service: Main process exited, code=exited, status=1/FAILURE
Dec 30 16:24:00 host1 systemd[1]: procrastinate.service: Failed with result 'exit-code'.
That's true. Let me rephrase reproduction steps:
- Defer a task that takes some time to complete
- Have a worker start executing the task
- While the task executes, cancel the task using the shell (
procrastinate shell / cancel <id>) - When the task finishes, it will crash the worker.
I would say that the problem is that the shell should not accept to cancel a task that has started. It's confusing to call this "cancellation" given that the task will not be interrupted.
If we want to include a "real" cancellation feature at some point, that is able to interrupt the running task, then we'll need to do that. Meanwhile we should just not make it possible to cancel an already running task. (it's fine to cancel a task before it starts though)
While implementing this, we'll likely need to use SELECT FOR UPDATE when checking the task state to avoid race conditions.
I was trying to implement task cancelation using the same principles as the shell when I ran into this bug.
It would be nice if JobContext.job status was updated when a job's status changes. Right now I'm settling for periodically checking the procrastinate_jobs table.
When you say we need to use SELECT FOR UPDATE do you mean only in the procrastinate_finish_job SQL function or do we need to do locking also when checking the current status of a task?
I'm willing to spend some time on this (mostly making the status of a JobContext.job update when a new status is set) if there is any interest in such functionality?
When you say we need to use SELECT FOR UPDATE do you mean only in the procrastinate_finish_job SQL function or do we need to do locking also when checking the current status of a task?
Ah I just meant when checking the state of the state before calling procrastinate_finish_job to ensure we don't finish a cancelled task.
It would be nice if JobContext.job status was updated when a job's status changes. Right now I'm settling for periodically checking the procrastinate_jobs table.
It could but it would be a change of paradigm. Currently, this is just a Python attribute lookup, and I'd like to avoid having hidden I/O (simple attribute lookups that trigger database operations). It looks innocent at first, up until the point you realize it creates all kinds of funky links in your codebase, and simple things start becoming hard.
That being said, I'm absolutely not opposed to exposing functions that would give this information in a nice way. That being said, I think there's already something you can use in procrastinate.manager.JobManager.list_jobs_async(id=<your job id>). I believe this will give you access to the job's state.
procrastinate.manager.JobManager.list_jobs(id=<your job id>) should also work, shouldn't it? However I can't seem to get it to work with synchronous tasks as listing the jobs raises a RuntimeError('This event loop is already running').
Ah, yeah, the problem is that the loop is not re-entrant...
The relevant code is: https://github.com/procrastinate-org/procrastinate/blob/master/procrastinate/utils.py#L190-L199
Considering we're in a task, so in async code, calling some sync code, which then wants to call async code... Yeah, it sucks :( I don't know if the answer should be threading (:skull: :fearful:) or some kinds of shenanigans or what...
Had time to return to this issue. I was thinking about someting like this:
CREATE FUNCTION procrastinate_finish_job(job_id integer, end_status procrastinate_job_status, next_scheduled_at timestamp with time zone, delete_job boolean) RETURNS void
LANGUAGE plpgsql
AS $$
DECLARE
_job_id bigint;
BEGIN
IF end_status NOT IN ('succeeded', 'failed') THEN
RAISE 'End status should be either "succeeded" or "failed" (job id: %)', job_id;
END IF;
IF delete_job THEN
DELETE FROM procrastinate_jobs
WHERE id = job_id AND status IN ('todo', 'doing', 'failed')
RETURNING id INTO _job_id;
ELSE
SELECT id INTO _job_id FROM procrastinate_jobs
WHERE id = job_id AND status IN ('todo', 'doing', 'failed')
FOR UPDATE;
UPDATE procrastinate_jobs
SET status = end_status,
attempts =
CASE
WHEN status = 'doing' THEN attempts + 1
ELSE attempts
END
WHERE id = job_id AND status IN ('todo', 'doing')
RETURNING id INTO _job_id;
END IF;
IF _job_id IS NULL THEN
RAISE 'Job was not found or not in "doing", "todo" or "failed" status (job id: %)', job_id;
END IF;
END;
$$;
This would be the easiest option (the other being adding a "CANCELED" status but would require a bit more code) and then we could work on things outlined in https://github.com/procrastinate-org/procrastinate/issues/780#issuecomment-1529005989 to make it more feature complete, cause right now each job would have to resort in pulling the database at some desired rate as there is no signaling to indicate a job was canceled.
I think it would make sense to add CANCELED and maybe also ABORTED as possible states (though ABORTED is not possible as of now, but once it is, it will help knowing whether part of the job ran vs none at all). I'm afraid doubling down calling canceled tasks "failed" will only lead to more ambiguity down the line. That was a bad choice I made, it's probably time to revisit it :)