pg-boss icon indicating copy to clipboard operation
pg-boss copied to clipboard

pgBoss.stop doesn't remove active jobs

Open dolegi opened this issue 2 years ago • 5 comments

Hey, first off thanks so much for pgboss is an extremely useful library!

when calling pgBoss.stop() and waiting for the stopped event, jobs that take longer than the timeout get stuck in an active state.

What currently happens

We have some singleton jobs that run for between ~10mins up to just over 1hour. So we have set them to only expire after 120 minutes. When we re-deploy our job workers, the active jobs stay in pgboss until they expire, so the job doesn't get re-triggered until the active job (that no worker is working on) expires.

Request

Ideally when re-deploying we can catch the SIGTERM, call pgboss.stop({timeout: x}) which will stop the worker and remove any active jobs.

TL;DR Request

Have pgBoss.stop() delete/update active jobs when the worker stops.

Or should we be manually deleting active jobs, by tracking jobId's and manually updating the pgboss.job table. Is there a recommended way to approach this?

Related issues

https://github.com/timgit/pg-boss/issues/268

Thanks!

dolegi avatar Feb 04 '22 14:02 dolegi

Hey, thanks! I agree with your suggestion, which is pretty similar to the expiration promise that is started along with jobs in the worker. I will look into an ideal way of opting into this.

Also, have you considered listening to SIGTERM in your worker callback function to do your own failure?

timgit avatar Feb 05 '22 00:02 timgit

Hi tim, thanks for looking into it. We are considering updating the job statuses directly but it feels wrong and against the way to properly work with pgboss.

UPDATE pgboss.jobs SET state = '<abandoned>' where state=active and id in <ids from the instance worker>;

We have to be careful to only update the job ids from the current instance, since other instance workers could still be actively processing jobs.

dolegi avatar Feb 07 '22 13:02 dolegi

Hi @timgit any updates on this?

StarpTech avatar Aug 27 '23 10:08 StarpTech

No work is being planned for this request right now. There is a reason SQS doesn't allow you to hold on to a message for hours, first of all. But long-running promises aside, I think the best approach would be to fail the jobs after the timeout. They would be eligible for retry at that point by another worker.

timgit avatar Aug 30 '23 22:08 timgit

I'll consider adding this into v10

timgit avatar Aug 31 '23 14:08 timgit