procrastinate icon indicating copy to clipboard operation
procrastinate copied to clipboard

Retry Failed Job

Open ticosax opened this issue 9 months ago • 5 comments
trafficstars

This is an action intended to be manually invoked that put the failed job's state back to todo, while keeping the events history.

Only integrated within django (for now) It's an Early submission (draft) for the purpose of getting feedback the maintainers. I would like to get confirmation that's something they'd like to consider, assuming I'll push it through with tests and updated documentation. only Tested manually on my machine as it is.

Closes N/A

Successful PR Checklist:

  • [x] Tests
    • [ ] (not applicable?)
  • [x] Documentation
    • [ ] (not applicable?)

PR label(s):

  • [ ] https://github.com/procrastinate-org/procrastinate/labels/PR%20type%3A%20breaking%20%F0%9F%92%A5
  • [x] https://github.com/procrastinate-org/procrastinate/labels/PR%20type%3A%20feature%20%E2%AD%90%EF%B8%8F
  • [ ] https://github.com/procrastinate-org/procrastinate/labels/PR%20type%3A%20bugfix%20%F0%9F%95%B5%EF%B8%8F
  • [ ] https://github.com/procrastinate-org/procrastinate/labels/PR%20type%3A%20miscellaneous%20%F0%9F%91%BE
  • [ ] https://github.com/procrastinate-org/procrastinate/labels/PR%20type%3A%20dependencies%20%F0%9F%A4%96
  • [x] https://github.com/procrastinate-org/procrastinate/labels/PR%20type%3A%20documentation%20%F0%9F%93%9A

ticosax avatar Feb 12 '25 18:02 ticosax

Hello!

I like the idea and large parts of the work already done. I'm not sure I think a retried status is the best though. We had issues when we introduced a aborting state, which we had to undo. I think a job can be both retried AND todo, or doing, or suceeded or failed. If we want to capture that, it would rather be, I think, an event in the events table (capturing the fact a same job might be retried multiple times).

I like the idea of Django admin actions. but this immediately bears the idea of also aborting a task through the admin, though that might be for a later PR.

ewjoachim avatar Feb 17 '25 21:02 ewjoachim

Hi, thanks for your feedback. I'm not sure I completely understand your recommendations though :)

I'm not sure I think a retried status is the best though.

I agree I haven't introduced one.

If we want to capture that, it would rather be, I think, an event in the events table

Exactly, we are in agreement. and the PR does that already ...

Am I hallucinating or you reviewed it a bit too quickly :)

Nevertheless, I will add some tests it will make feature clearer on how it's supposed to work, and maybe it will reveal my hallucinations. Stay tuned

ticosax avatar Mar 03 '25 13:03 ticosax

In addition to django admin, do you see a need to expose this feature on the cli ? Where else could it be useful ?

ticosax avatar Mar 03 '25 15:03 ticosax

Hm, maybe I have, I'll take some time time and reread through, apologies.

ewjoachim avatar Mar 04 '25 07:03 ewjoachim

Coverage report

This PR does not seem to contain any modification to coverable code.

github-actions[bot] avatar Apr 02 '25 08:04 github-actions[bot]

First: sorry it took me so long to dive into this.

No need, I assume we are all volunteers here.

This seems, nice, I'm wondering if you can share a bit on what led you to create a new function procrastinate_retry_failed_job_v1 rather than adapting procrastinate_retry_job_v1. I'm not saying it's not a good choice, it might be perfectly fine, but I feel I'm missing a bit of the reasons (might be "to avoid messing with existing code paths" ?)

Because the signature is different, This version don't accept retry_at (as explained in my other comment). No other reasons I can remember of. I'll gladly change the code and follow your guidance if you prefer an alternative.

All in all, this looks great :) I'd love an opinion from @medihack

thanks :)

ticosax avatar Jul 07 '25 08:07 ticosax

This seems, nice, I'm wondering if you can share a bit on what led you to create a new function procrastinate_retry_failed_job_v1 rather than adapting procrastinate_retry_job_v1. I'm not saying it's not a good choice, it might be perfectly fine, but I feel I'm missing a bit of the reasons (might be "to avoid messing with existing code paths" ?)

I also find it a little unusual to have two retry functions now, but I can certainly adapt to it and understand the reasoning behind it. Just a counterproposal... how about something like this:

CREATE FUNCTION procrastinate_retry_job_v2(
    job_id bigint,
    retry_at timestamp with time zone,
    new_priority integer,
    new_queue_name character varying,
    new_lock character varying
) RETURNS void LANGUAGE plpgsql AS $$
DECLARE
    _job_id bigint;
    _abort_requested boolean;
    _current_status procrastinate_job_status;
BEGIN
    SELECT status, abort_requested FROM procrastinate_jobs
    WHERE id = job_id AND status IN ('doing', 'failed')
    FOR UPDATE
    INTO _current_status, _abort_requested;
    IF _current_status = 'doing' AND _abort_requested THEN
        UPDATE procrastinate_jobs
        SET status = 'failed'::procrastinate_job_status
        WHERE id = job_id AND status = 'doing'
        RETURNING id INTO _job_id;
    ELSE
        UPDATE procrastinate_jobs
        SET status = 'todo'::procrastinate_job_status,
            attempts = attempts + 1,
            scheduled_at = retry_at,
            priority = COALESCE(new_priority, priority),
            queue_name = COALESCE(new_queue_name, queue_name),
            lock = COALESCE(new_lock, lock)
        WHERE id = job_id AND status IN ('doing', 'failed')
        RETURNING id INTO _job_id;
    END IF;

    IF _job_id IS NULL THEN
        RAISE 'Job was not found or has an invalid status to retry (job id: %)', job_id;
    END IF;
END;
$$;

The retry_at could be set to the present time or we can make all parameters fully optional by using DEFAULT NULL and adding scheduled_at = COALESCE(retry_at, scheduled_at).

Additionally, as already suggested, it would be beneficial to have a way to execute it using the CLI (although we can also add this feature using a second PR later).

That all said, it's absolutely a nice feature 🙂.

medihack avatar Jul 07 '25 23:07 medihack

Thanks, I just pushed changes reflecting your suggestion. Now we are using the same retry procedure for TODO and FAILED jobs. Django integration, defaults on retry_at=now(), for sake of simplicity.

ticosax avatar Jul 16 '25 08:07 ticosax

I don't understand the failure related to versioning of migrations. I tried multiple version schemes. none of them worked.

ticosax avatar Jul 16 '25 10:07 ticosax

I don't understand the failure related to versioning of migrations. I tried multiple version schemes. none of them worked.

This system is quite new so it's not impossible that I forgot a couple of things in it. I'll take it from there :)

ewjoachim avatar Jul 16 '25 12:07 ewjoachim

When I can't rebase my mistakes away (because mainatainers can't rebase on a fork PR), it's much harder to do 🦄✨magic🪄🎩

ewjoachim avatar Jul 16 '25 23:07 ewjoachim

Thank you for your patience !

ewjoachim avatar Jul 18 '25 09:07 ewjoachim