pg-boss icon indicating copy to clipboard operation
pg-boss copied to clipboard

Prioritize retry state over created for same singleton key in stately fetch

Open foteiniskiada opened this issue 11 months ago • 6 comments

Issue: https://github.com/timgit/pg-boss/issues/535

foteiniskiada avatar Jan 15 '25 15:01 foteiniskiada

@timgit could you please take a look? We are experiencing issues is our production application

klesgidis avatar Jan 22 '25 08:01 klesgidis

Thanks for pointing out this limitation with stately queues. You're correct in your assessment regarding how to extend a stately queue with a singletonKey. After reviewing the PR, I think it adds too much complexity to enforce across all queue types, and would negatively impact performance in a larger queue. I also don't think it will correctly resolve all failure use cases.

The failure case you're seeing is because once the next job would produce a unique constraint violation (with or without a singletonKey), all job processing will be blocked.

One example of how this fails is:

  1. Create job A with singletonKey=123. It's in created state.
  2. Create job B with singletonKey=123. Job B is rejected correctly because of the unique constraint. This is the happy path use case.
  3. Fetch, putting job A in active state.
  4. Create job C with singletonKey=123. It's in created state.
  5. Create job D with singletonKey=456. It's in created state.
  6. Fail job A, putting it in retry state.
  7. Now, attempt to fetch 2 jobs. Since the default sort is by creation date, the next 2 jobs have the same singletonKey. This attempts to set both jobs to active state, but only 1 is allowed.

The only way I see to avoid this use case is to not use batching with stately queues. The batch processing SQL statement needs to be enhanced to allow dropping one of the previously accepted jobs. This feels like a gray area since the job was previously accepted, but stately queues are already the type of policy that is accustomed to dropping jobs. In its current state, batching with a unique constraint violation will block all processing until the batch size is reduced back to 1.

Another side effect of this behavior is more closely related to this PR, which adds a sort condition for the singletonKey. However, this would still produce a processing limitation once a conflict is experienced on a particular key. Once a unique constraint is triggered for any job, no other jobs can be processed. This more closely aligns with the original intent of these queue policies, which is to reduce concurrency as much as possible.

timgit avatar Jan 26 '25 15:01 timgit

Thank you for your detailed response and for the great work you are doing with pgboss. We truly appreciate your efforts in maintaining and improving this library.

The only way I see to avoid this use case is to not use batching with stately queues. The batch processing SQL statement needs to be enhanced to allow dropping one of the previously accepted jobs. This feels like a gray area since the job was previously accepted, but stately queues are already the type of policy that is accustomed to dropping jobs. In its current state, batching with a unique constraint violation will block all processing until the batch size is reduced back to 1.

We understand your perspective; however, using batchSize=1 is not a viable solution for us due to the scale of our production application. To provide some context, our system processes hundreds of jobs per second. Reducing the batch size to 1 would result in significant delays and impact overall performance. Therefore, we need to find a solution that allows us to continue utilizing batch processing while addressing the unique constraint issue introduced in version 10. This is why we proposed the PR.

After reviewing the PR, I think it adds too much complexity to enforce across all queue types, and would negatively impact performance in a larger queue. I also don't think it will correctly resolve all failure use cases.

Could you provide more details or results from performance tests that highlight this impact? In our view, a queue processing one job per query (batch size 1) represents a far greater performance concern for the entire system. We would be interested in understanding how the proposed changes specifically add complexity or impact performance in large queues.

Another side effect of this behavior is more closely related to this PR, which adds a sort condition for the singletonKey. However, this would still produce a processing limitation once a conflict is experienced on a particular key. Once a unique constraint is triggered for any job, no other jobs can be processed. This more closely aligns with the original intent of these queue policies, which is to reduce concurrency as much as possible.

Could you clarify how a conflict would occur, given that the implementation uses DISTINCT on the singletonKey? From our understanding, this should prevent such conflicts from arising.

We have been attempting to upgrade to v10 for nearly two months now but are facing significant performance issues without batch processing. We sincerely hope to find a resolution through collaboration, as we value the capabilities of pgboss. However, if we cannot maintain the necessary system performance, we may need to explore alternative solutions.

Once again, thank you for your hard work and for taking the time to consider our input. We look forward to hearing your thoughts on this matter.

klesgidis avatar Jan 28 '25 15:01 klesgidis

@timgit any update on this?

klesgidis avatar Feb 19 '25 13:02 klesgidis

This is a challenging one that I'm still thinking through. You may want to avoid using singleton and stately queues with singleton keys for now.

timgit avatar Feb 25 '25 00:02 timgit

This is a challenging one that I'm still thinking through.

Thank you! Let us know if there is anything we can do to help

You may want to avoid using singleton and stately queues with singleton keys for now.

Unfortunately we can't since it is essential for our application.

klesgidis avatar Feb 25 '25 14:02 klesgidis

@timgit I think the cleanest solution is to add a new queue type exactly_once where only 1 job can be in (created,retry,active)

fenos avatar Jul 11 '25 19:07 fenos

This would have the side effect of losing "queue semantics" in my opinion. If a job is active and you want to queue another one once it's done, users wouldn't enjoy "polling" the queue via send() until it finally accepts it.

timgit avatar Jul 11 '25 19:07 timgit

Hi @timgit, kinda, I have a use case where I run some SQL migrations on different databases via a queue handler. I want to be sure only 1 job is currently running, and no other job should be queued if one is already running, naturally then sharding by singleton_key

We could solve the above problem with this new mode and tradeoffs, this way we wouldn't have 2 jobs trying to get into the active state and retries works as expected

fenos avatar Jul 13 '25 11:07 fenos

We ran into this problem :(

eloff avatar Aug 06 '25 21:08 eloff

I've added some updates to the linked issue at the top of the thread, #535. I think I'm at a place where I could stop work and push v11. There are some outstanding requests about enhancing how scheduling works, but in the interest of incremental delivery, it's probably time to just wrap it up and ship it to get feedback from you. It's a slightly different approach than what is proposed here in the PR, so I'm planning on closing this once v11 is out.

timgit avatar Sep 23 '25 21:09 timgit