procrastinate icon indicating copy to clipboard operation
procrastinate copied to clipboard

A canceled task that completes fails with "Job was not found or not in "doing" or "todo" status"

Open wokis opened this issue 3 years ago • 7 comments

A task that is canceled has it status set to FAILED, however when the task completes only DOING and TODO is accepted for task completion. This causes procrastinate to fail and terminate it's process.

https://github.com/procrastinate-org/procrastinate/blob/65829b3fc17374985ee17c2911083f60132f1bfd/procrastinate/shell.py#L142 https://github.com/procrastinate-org/procrastinate/blob/65829b3fc17374985ee17c2911083f60132f1bfd/procrastinate/sql/schema.sql#L207

Dec 30 16:24:00 host1 python3[1280546]: Error:
Dec 30 16:24:00 host1 python3[1280546]:     Database error.
Dec 30 16:24:00 host1 python3[1280546]:     
Dec 30 16:24:00 host1 python3[1280546]: Job was not found or not in "doing" or "todo" status (job id: 47)
Dec 30 16:24:00 host1 python3[1280546]: CONTEXT:  PL/pgSQL function procrastinate_finish_job(integer,procrastinate_job_status,timestamp with time zone,boolean) line 24 at RAISE
Dec 30 16:24:00 host1 systemd[1]: procrastinate.service: Main process exited, code=exited, status=1/FAILURE
Dec 30 16:24:00 host1 systemd[1]: procrastinate.service: Failed with result 'exit-code'.

wokis avatar Dec 30 '21 16:12 wokis

That's true. Let me rephrase reproduction steps:

  • Defer a task that takes some time to complete
  • Have a worker start executing the task
  • While the task executes, cancel the task using the shell (procrastinate shell / cancel <id>)
  • When the task finishes, it will crash the worker.

I would say that the problem is that the shell should not accept to cancel a task that has started. It's confusing to call this "cancellation" given that the task will not be interrupted.

If we want to include a "real" cancellation feature at some point, that is able to interrupt the running task, then we'll need to do that. Meanwhile we should just not make it possible to cancel an already running task. (it's fine to cancel a task before it starts though)

While implementing this, we'll likely need to use SELECT FOR UPDATE when checking the task state to avoid race conditions.

ewjoachim avatar Dec 30 '21 20:12 ewjoachim

I was trying to implement task cancelation using the same principles as the shell when I ran into this bug.

It would be nice if JobContext.job status was updated when a job's status changes. Right now I'm settling for periodically checking the procrastinate_jobs table.

When you say we need to use SELECT FOR UPDATE do you mean only in the procrastinate_finish_job SQL function or do we need to do locking also when checking the current status of a task?

I'm willing to spend some time on this (mostly making the status of a JobContext.job update when a new status is set) if there is any interest in such functionality?

wokis avatar Jan 01 '22 21:01 wokis

When you say we need to use SELECT FOR UPDATE do you mean only in the procrastinate_finish_job SQL function or do we need to do locking also when checking the current status of a task?

Ah I just meant when checking the state of the state before calling procrastinate_finish_job to ensure we don't finish a cancelled task.

It would be nice if JobContext.job status was updated when a job's status changes. Right now I'm settling for periodically checking the procrastinate_jobs table.

It could but it would be a change of paradigm. Currently, this is just a Python attribute lookup, and I'd like to avoid having hidden I/O (simple attribute lookups that trigger database operations). It looks innocent at first, up until the point you realize it creates all kinds of funky links in your codebase, and simple things start becoming hard.

That being said, I'm absolutely not opposed to exposing functions that would give this information in a nice way. That being said, I think there's already something you can use in procrastinate.manager.JobManager.list_jobs_async(id=<your job id>). I believe this will give you access to the job's state.

ewjoachim avatar Jan 01 '22 21:01 ewjoachim

procrastinate.manager.JobManager.list_jobs(id=<your job id>) should also work, shouldn't it? However I can't seem to get it to work with synchronous tasks as listing the jobs raises a RuntimeError('This event loop is already running').

wokis avatar Jan 04 '22 12:01 wokis

Ah, yeah, the problem is that the loop is not re-entrant...

The relevant code is: https://github.com/procrastinate-org/procrastinate/blob/master/procrastinate/utils.py#L190-L199

Considering we're in a task, so in async code, calling some sync code, which then wants to call async code... Yeah, it sucks :( I don't know if the answer should be threading (:skull: :fearful:) or some kinds of shenanigans or what...

ewjoachim avatar Jan 04 '22 13:01 ewjoachim

Had time to return to this issue. I was thinking about someting like this:

CREATE FUNCTION procrastinate_finish_job(job_id integer, end_status procrastinate_job_status, next_scheduled_at timestamp with time zone, delete_job boolean) RETURNS void
    LANGUAGE plpgsql
AS $$
DECLARE
    _job_id bigint;
BEGIN
    IF end_status NOT IN ('succeeded', 'failed') THEN
        RAISE 'End status should be either "succeeded" or "failed" (job id: %)', job_id;
    END IF;
    IF delete_job THEN
        DELETE FROM procrastinate_jobs
        WHERE id = job_id AND status IN ('todo', 'doing', 'failed')
        RETURNING id INTO _job_id;
    ELSE
        SELECT id INTO _job_id FROM procrastinate_jobs
        WHERE id = job_id AND status IN ('todo', 'doing', 'failed')
        FOR UPDATE;

        UPDATE procrastinate_jobs
        SET status = end_status,
            attempts =
                CASE
                    WHEN status = 'doing' THEN attempts + 1
                    ELSE attempts
                END
        WHERE id = job_id AND status IN ('todo', 'doing')
        RETURNING id INTO _job_id;
    END IF;
    IF _job_id IS NULL THEN
        RAISE 'Job was not found or not in "doing", "todo" or "failed" status (job id: %)', job_id;
    END IF;
END;
$$;

This would be the easiest option (the other being adding a "CANCELED" status but would require a bit more code) and then we could work on things outlined in https://github.com/procrastinate-org/procrastinate/issues/780#issuecomment-1529005989 to make it more feature complete, cause right now each job would have to resort in pulling the database at some desired rate as there is no signaling to indicate a job was canceled.

wokis avatar Dec 18 '23 10:12 wokis

I think it would make sense to add CANCELED and maybe also ABORTED as possible states (though ABORTED is not possible as of now, but once it is, it will help knowing whether part of the job ran vs none at all). I'm afraid doubling down calling canceled tasks "failed" will only lead to more ambiguity down the line. That was a bad choice I made, it's probably time to revisit it :)

ewjoachim avatar Dec 21 '23 14:12 ewjoachim