azure-service-bus icon indicating copy to clipboard operation
azure-service-bus copied to clipboard

[FEATURE REQ] Service Bus abandon message with custom delay

Open jsquire opened this issue 2 years ago • 60 comments

Issue Transfer

This issue has been transferred from the Azure SDK for .NET repository, #9473.

Please be aware that @@Bnjmn83 is the author of the original issue and include them for any questions or replies.

Details

This is still a desired feature and totally makes sense in many scenarios. Is there any effort to implement this in the future?

Sometimes business logic decides that it would be good to retry a message at some latter time. For this reason it would be very helpful, if Abandon(IDictionary<>) or similar, would be able to set ScheduledEnqueueTimeUtc.

msg.Abandon(new Dictionary<string, object>() { { "ScheduledEnqueueTimeUtc", DateTime.Now.ToUniversalTime().AddMinutes(2) } }); Right now, this does not work, because only custom properties can be manipulated this way. I’m also happy to hear if this kind of retry can be achieved on some other easier way? Right now, I typically set LockDuration on some reasonable retry time and avoid invoking of abandon in PeekLock mode. Another way is Deferring, but I don’t like it, because it requires me track deferred message, which makes things more complicate then necessary.

To recap all, wouldn’t be good to have something between Deferr() and Abandon()? For example Defer(TimeSpan) or Defer(ScheduleteTimeUtc) or Abandon(TimeSpan) or Abandn(TimeSpan)?! Only difference would be, that in a case of Defer, property DeliveryCount wouldn’t be incremented

Original Discussion


@msftbot[bot] commented on Tue Jan 14 2020

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jfggdl


@msftbot[bot] commented on Tue Jan 14 2020

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jfggdl


@jsquire commented on Tue Jan 14 2020

@nemakam and @binzywu: Woudl you be so kind as to offer your thoughts?


@nemakam commented on Tue Jan 14 2020

@Bnjmn83, This is a feature ask that we could work in the future, but we don't have an ETA right now. As an alternate solution, you can implement this yourself on the client using the transaction feature. Essentially, complete() the message and send a new message with appropriate "scheduleTime" within the same transaction. That should behave similarly.


@axisc commented on Thu Aug 13 2020

I think @nemakam's recommendation of completing the message and sending a scheduled message is a better approach.

Service Bus (or any message/command broker) is a server/sender side cursor. When a receiver/client wants to control when the message is visible again (i.e. custom delay/retry) it must take over the cursor from the sender. This can be achieved with the below options -

  • Completing the message and then resending with a scheduled message.
  • Deferring message and receiving explicitly.

Do let me know if this approach is too cumbersome and we can revisit. If not, I can close this issue.


@mack0196 commented on Wed Mar 31 2021

If the subscription\queue has messages in there, will the scheduled message 'jump to the front of the line' at its scheduled time?


@ramya-rao-a commented on Mon Nov 01 2021

@shankarsama Please consider moving this issue to https://github.com/Azure/azure-service-bus/issues where you track feature requests for the service

jsquire avatar Nov 01 '21 17:11 jsquire

This is such a critical, overdue feature. Having 'delivery counts' is useless without this, because anytime something fails, it just retries N times in rapid succession and deadletters anyway. This extra processing just makes bad situation N times worse. We need to be able to 'update' the message properties AND (more importantly) reschedule the original message to run with exponential backup or whatever algorithm we want. We can control this by storing the original message time in the user properties collection for example, and computing next delay using the current delivery count. Using transactions to reenqueue a new message while completing the existing one is not a good option. Should not have to resend the entire message payload. I would recommend just updating the AbandonAsync method to include overloads that accepts an updated scheduled enqueue time in addition to the updated user properties.

A scheduled message really can't 'be in line' when it's scheduled. It's just at a theoretical point in time. When that point in time elapses, the message should just 'get in line' at that point in time (end of the line). The 'delay' is the more important functionality, not the specific time. Queued messages are queued and are delayed by nature.

triynko avatar Feb 02 '22 03:02 triynko

+1 for this please. Not much use in DDOS'ing our own services. Exponential back off policy would be a fantastic feature to add.

brian-duffy avatar Feb 13 '22 13:02 brian-duffy

One way to accomplish this already today is to use message deferral combined with a scheduled message. For this, you would defer the message, and place it's sequence number in a scheduled message. When the scheduled message comes in, use the sequence number to retrieve the deferred message and process it. Please let us know if this works for you.

EldertGrootenboer avatar Feb 24 '22 21:02 EldertGrootenboer

@EldertGrootenboer I'm interpreting the original as asking for a built-in functionality to reduce the code complexity required for something that should be a simple message disposition. Whenever a workaround that involves several operations is involved, not only that incurs an additional cost on the service level, but also cognitive tax and complexity added to codebases. Walking through the workaround, here's what needs to happen:

  1. To ensure two different operations are atomic, message deferral and scheduling have to be done in a transaction.
  2. Sending a message out requires access to a message sender. If you're in a scenario/codebase where you don't have access to the object, you either have to create a sender (expensive) or rely on luck and skip atomicity altogether. Example: Azure Functions or a custom messaging framework.
  3. When receiving a scheduled message, you now have to proactively fetch an additional message, requiring a message receiver. Again, in quite a few scenarios, this is not possible. Think Azure Functions. You don't want to create a receiver to read the deferred message for each invocation.

To sum it up, there are scenarios where abandoning with a custom delay is necessary and workarounds cannot provide the same value a feature would. I hope this helps.

SeanFeldman avatar Feb 24 '22 22:02 SeanFeldman

Thank you for your feedback! Although this is not something that should be done with either Abandon or Defer, as it would change the semantics of those actions, it is something we want to look into putting on the backlog. I would like to align with you for this, to get the details for your scenario. @Bnjmn83, @triynko and @SeanFeldman could you drop me a message on [email protected], and we can take it from there.

EldertGrootenboer avatar Feb 25 '22 01:02 EldertGrootenboer

@RichardGaoF, abandoning is never about deferring. With regular abandon operation the message goes back to the queue and is available right away. With this feature, the ask is for the message to be delayed for the provided time span upon abandoning, and then become available automatically.

SeanFeldman avatar Apr 04 '22 14:04 SeanFeldman

One way to accomplish this already today is to use message deferral combined with a scheduled message.

For this, you would defer the message, and place it's sequence number in a scheduled message.

When the scheduled message comes in, use the sequence number to retrieve the deferred message and process it.

Please let us know if this works for you.

Thanks @EldertGrootenboer I have just one question that seems the deferred-time must be a fixed timespan set at scheduling the message? In other words, supposing setting the timespan as 10 minutes, does that mean The message will be enqueued in 10 minutes(scheduled) then also be deferred 10 minutes per retrieving and checking some custom conditions by the consumer OR The message will be enqueued in 10 minutes(scheduled) then the consumer will not retrieve the message UNTIL some custom conditions meet (works like an event trigger mode)?

I am expecting the latter, but looks it's actually the former (only be able to set a fixed timespan for a deferred message)?

RichardGaoF avatar Apr 04 '22 14:04 RichardGaoF

@RichardGaoF You don't set the timespan on the deferred message, but on the scheduled message instead. The deferred message will stay on the queue until it is explicitly retrieved using it's sequence number.

The scheduled message will be enqueued after the timespan has elapsed, and will be placed at the back of the queue. Once it is picked up by a consumer, that consumer will then use the sequence number which was added to the scheduled message to retrieve the deferred message.

EldertGrootenboer avatar Apr 04 '22 16:04 EldertGrootenboer

@RichardGaoF You don't set the timespan on the deferred message, but on the scheduled message instead. The deferred message will stay on the queue until it is explicitly retrieved using it's sequence number.

The scheduled message will be enqueued after the timespan has elapsed, and will be placed at the back of the queue. Once it is picked up by a consumer, that consumer will then use the sequence number which was added to the scheduled message to retrieve the deferred message.

Thank you @EldertGrootenboer . So if I implement it in the loops, the deferred message will be always inside(set aside) the queue during looping till be received, handled and completed, and each time of loop need a newly created and scheduled message with two properties. Its timespan parameter works like the loop interval and messageID should always be the sequence number of the deferred message. Correct?

RichardGaoF avatar Apr 05 '22 09:04 RichardGaoF

@RichardGaoF you’re confusing message deferral and abandoning with a time-out. With this feature you don’t need to use message’s sequence number. The message won’t change its ID or anything else besides DeliveryCount because it will be the same message. Have a look at how abandoning works and add to that a back-off time that would be added. That’s it.

SeanFeldman avatar Apr 05 '22 13:04 SeanFeldman

@SeanFeldman @EldertGrootenboer Thanks. Maybe I have known each concept of the peek-lock, abandon, lock expires, DLQ, TTL expire, scheduled message, deferred message ..., but there seems never an article on the Internet (including MSDN) being able to clearly describe all of them working together. Maybe there are some metaphors that they never work together, but if it does not say out, readers don't know or at least are not sure just like my current situation. Anyway, please allow me to try to describe the following typical scenario using all such concepts together.

First, don't involve the scheduled message and deferred message concepts

We have just a "general" queue. There is a TTL timeout value of the queue self which means the message will be moved to DLQ if it has not been consumed after the TTL expires. At the peek-lock, a consumer polling requests then the queue locks and sends next message to the consumer. If the consumer cannot process this message and abandon it or the processing time exceeds the lock-timeout, queue unlocks this message to be re-visible to all consumers. Here is also a max delivery count, and the message will be moved to the DLQ too if exceeds the count.

Now let's involve the scheduled message and deferred message concepts

scheduled and deferred messages

  1. There are an to-be-scheduled message and an to-be--deferred message, and the to-be-scheduled message's ID is the to-be--deferred message's sequence number. Schedule the to-be-scheduled message with a timespan and defer the to-be--deferred message.

  2. The scheduled message will not be enqueued until arriving at the timespan.

  3. A Consumer polling requests, then the queue locks and delivers the scheduled message to the consumer. The consumer uses the scheduled message ID(just the sequence number of the deferred message) to retrieve the deferred message and TRIES to process it.

  4. Here are the QUESTIONS: If the deferred message has NOT been ready to be processed (or say 'failed to be processed'), the consumer will 1) directly create a new scheduled message; 2) schedule/enqueue the new message; 3) Defer a new copy of the deferred message; 4) Complete the original deferred message; 5) Complete the original scheduled message ? OR 1) abandon the deferred message to be visible in the queue again?

  5. If the former (4-1), the deferred message will never exceed the max delivery count to be moved to the DLQ (actually, each deferred message will be delivered one time only). Else if the latter (4-2), once the abandon times exceed the max delivery count, the deferred message will be moved to the DLQ, but there will never be new scheduled messages and new copies of the deferred message.

Which above one is the real behavior of the message deferral?

I personally prefer to the former, but not very sure because the MSDN doc locks more details and examples and this article with a example looks confused the scheduled message and deferred message.

RichardGaoF avatar Apr 07 '22 04:04 RichardGaoF

@RichardGaoF, there's no deferral for this feature. Plain and simple. This issue is talking about the ability to abandon a message and specify a timeout. When a message is abandoned today, it goes back to the queue and is available for processing right away if there are no other messages in the queue. What this issue is about is adding a delay to an abandoned message, so that rather than appearing immediately, it would be delayed. It's the same message. No need to create a new message, no need to defer and hold on to a message sequence number, non of that.

The delivery count and dead-lettering would continue to work exactly the same way because the message is the same message.

If this still doesn't answer your question, I suggest moving a discussion to an email.

SeanFeldman avatar Apr 07 '22 06:04 SeanFeldman

@SeanFeldman I re-read whole conversations thread to understand the context of the issue more.

Yes, delaying an abandoned message to be visible again in the queue is not provided by any Azure SB OOB feature now, so the method @EldertGrootenboer recommended (message scheduling + deferral) could be understood as a workaround when no existing OOB feature could be used directly now, but with a shortage that it's just a once operation/deferral instead of a "do-deferral-while" operation. So under this once operation/deferral, just like you said, the delivery count and DLQ work normally if we abandon the deferred message in our consumer.

On the other hand, just like my current business logic faced to, a typical business scenario is continually deferring a message until some condition(s) meet(do-deferral-while), instead of deferring a message once only. Therefore, some guys implemented such do-deferral-while behavior by loop creating new scheduled message and new deferred message to re-enqueue, for example, my found one from Internet

In short, referring to my last post, if 4-1, no abandon and exceeding max delivery count at all and just loop creating new scheduled message and new deferred message to re-enqueue to realize the do-deferral-while logic, else if 4-2, after a once message deferral by using a scheduled message and a message deferral, abandon's message will be re-visible in the queue immediately.

Thanks for invitation, and I might join your emails discussion if I meet more problems when implement my business logic.

RichardGaoF avatar Apr 07 '22 08:04 RichardGaoF

One scenario where this request from @SeanFeldman would be useful is when you have sessions enabled and want to implement a circuit breaker on top. For example, I have multiple projects that send messages to a queue, session id is the customer id, and messages for the same customer need to be processed in order. But if there's a failure in processing one of the messages for a customer, requiring some manual intervention, a separate notification/workflow can be kicked off for manual investigation (say, a product is missing and needs to be created), and then reprocessing of the can continue. Being able to Abandon with a delay would be helpful so that specific session/customer 'pauses' processing while the issue is addressed, and the in-order requirement is not broken. It would at least make the solution to that requirement simpler I suspect.

nzthiago avatar Jul 01 '22 16:07 nzthiago

A feature like @SeanFeldman proposes would definitely simplify many solutions. I would propose to additionally have a delay on deferral to let the message go back to normal queue after delay time. This way we don't need to keep track of sequence number in all cases. Of course there are some things to think about like TTL of the message when returned to the queue.

The reason to have both is that I would like to differentiate between an exception (e.g. a resource not available) and "I want to handle this later" (e.g. ordered processing). Abandon would raise delivery count while deferr would not.

skastberg avatar Jul 03 '22 06:07 skastberg

We have put this on our backlog, thank you for everyone who gave their input. There are no implementation details or timelines to share yet, but we will update this thread as we progress.

EldertGrootenboer avatar Jul 21 '22 17:07 EldertGrootenboer

It's been more than a year for any updates on this? Just checking if it is still part of the backlog?

abhishek-msft avatar Oct 25 '22 15:10 abhishek-msft

This is indeed in the backlog, and we are currently creating a design for this. There are no timelines yet to share, but we will update this issue when we have more information.

EldertGrootenboer avatar Oct 26 '22 11:10 EldertGrootenboer

So happy this is in the backlog and in design phase. Basically, when we fail to process a message, it's because of some transient error. Maybe a database is unavailable, or some async action it depends on having completed hasn't yet completed. So we abandon the message.

The problem is that it gets picked up right away, fails again, we abandon it, then we repeat this N times based on max delivery count. Because there's virtually no delay between retries, there's no time for the transient error condition to resolve itself, so we're basically DOSing our own system unnecessarily, and after N deliveries, the message deadletters anyway. All is lost.

All we want is ability to abandon a message and specify some delay (or scheduled future date) before it will be picked up by subscribed processors again, which we can compute ourselves as some exponential back-off based on the current delivery count. Semantically, introducing this delay in the Abandon call makes the most sense to us. We want to release the lock on the message, but we want a delay introduced before it gets picked up for processing again automatically.

The workaround of just rescheduling the message is a bad idea for a few reasons. A completely new message resets delivery count to zero. So we'd have to create/track our own delivery count. This also increases delivery size. We also need metadata for the retry, like 'which pieces of processing failed'. For example, we have 'subscriptions' attached to handlers (these subscribers represent downstream systems that need notified that the message has arrived), so if 2 of 3 subscribers fail to be notified about the message, we have to embed these failed subscriptions in the rescheduled message and retry processing. That's a problem because we risk increasing the original message size with this property, and risk failure to reschedule. Tracking delivery count on our own is also a bad idea, because if we fail to update our internal count and the lock times out, we lose a count. So we'd have to a SUM of the the Azure-managed DeliveryCount + our InternalDeliveryCount. So there 3 problems there.

  • Entire messages being re-sent to a queue for each processing retry, which puts extra data load on a queue where we're already struggling to meet send SLAs.
  • We also must track delivery count on our own, because re-scheduling a message creates a new message that resets the count to zero.
  • We are also risking increasing size and overflowing max message size on retry by adding these extra properties after the fact, leaving us no option but to dead letter.

Now, the workaround that EldertGrootenboer came up with to defer the original message and submit a scheduled message with just the sequence number of the original message solves most, but not all, of those problems:

  • We no longer have to re-send the message data; deferring the original message leaves the original message intact
  • The smaller scheduled message with the sequence number allows us to implement the processing delay
  • We no longer need to track delivery count on our own; the original message is intact
  • We no longer have to embed metadata in the original message on abandon; the smaller scheduled message with the sequence number only can also carry the failed subscription identifiers for the next processing retry
  • We no longer have to worry about overflowing the message size; we're not adding any new metadata to it on abandon; it's all stored in the smaller schedule message that holds only identifiers

It also crease a new problem. We now have these 'deferred' messages, which are harder to work with, plus these extra smaller scheduled messages, which artificially increases the message counts in our queues and messes with alarm thresholds. There's also risk with leaving a message deferred indefinitely if something goes wrong processing the scheduled message that holds the identifier. It's all just unnecessary complexity that wouldn't be necessary if this simple and obvious feature was implemented.

Of course ALL of this would be solved by the requested feature here. When we pick up a message and processing fails because of a transient error, we can call Abandon and just supply a delay so the message is scheduled in a way where it's not picked up by subscribed processors until after some delay, rather than immediately. (Note the use of the term 'subscribed processsors' here is different from the 'subscribers' I mentioned earlier; our 'subscribers' represent downstream systems that need notified about a message being processed).

triynko avatar Jan 19 '23 20:01 triynko

I have been using some workarounds for this issue, and I just figured out there is an undesired side-effect: if one is using topics/subscriptions, then sending a new message to the topic when there is a failure in processing results in that all of the subscriptions will receive it which is quite unfortunate.

I really hope that this feature will be implemented soon as it is long overdue. Is there some estimate on when we can expect a possibility to delay the message processing without re-sending?

ilya-scale avatar Feb 23 '23 09:02 ilya-scale

@ilya-scale I've done the same workaround and figured out the same side-effect. My workarount creates an "adicional" header in the new message with the name of the subscription that triggered the "deferral", so the other subcriptions look to this header and just ignore de message if it is not addressed to it.... something like wireless protocol.

maxandriani avatar Feb 27 '23 14:02 maxandriani

This is indeed in the backlog, and we are currently creating a design for this. There are no timelines yet to share, but we will update this issue when we have more information.

How did the designing go? Did you run into any issues? Curious for an update!

Zenuka avatar Mar 14 '23 07:03 Zenuka

Well put by @triynko :

The problem is that it gets picked up right away, fails again, we abandon it, then we repeat this N times based on max delivery count. Because there's virtually no delay between retries, there's no time for the transient error condition to resolve itself, so we're basically DOSing our own system unnecessarily, and after N deliveries, the message deadletters anyway. All is lost.

This is even more important with the Azure Function trigger. In case of failure, the same function is retriggered immediately.

Looking forward to a fix sooner for this :)

jatinpuri-microsoft avatar Apr 13 '23 10:04 jatinpuri-microsoft

Well put by @triynko :

The problem is that it gets picked up right away, fails again, we abandon it, then we repeat this N times based on max delivery count. Because there's virtually no delay between retries, there's no time for the transient error condition to resolve itself, so we're basically DOSing our own system unnecessarily, and after N deliveries, the message deadletters anyway. All is lost.

This is even more important with the Azure Function trigger. In case of failure, the same function is retriggered immediately.

That's a problem we are facing right now, there are some possible workarounds like catching the exception and throwing it after a delay but that's far from ideal. Hopefully this will soon be included in the servicebus ;)

wouter-b avatar Apr 13 '23 12:04 wouter-b

When using a scheduled message to act as a pointer to a deferred message, where are people sending the scheduled message? In our use case were dealing with topics, and sending a scheduled message back to the topic would impact all subscriptions (without using subscription filters).

So this leaves us having to set up a dedicated queue for scheduled messages acting as deferred message pointers. But then when you have multiple applications subscribing to a topic, each application needs its own queue to handle scheduling of deferred messages.

It would be great if this could be handed internally by Service Bus, maybe as an extension to Complete & Abandon, there could be a DeferUntil(TimeSpan)?

philip-reed avatar Apr 18 '23 11:04 philip-reed

This is indeed in the backlog, and we are currently creating a design for this. There are no timelines yet to share, but we will update this issue when we have more information.

We are facing the same challenge as was previously described by @nzthiago, having a service bus triggered function with sessions enabled where we need to delay (backoff/circuit-break) the repetitive execution of a whole sequence of messages in a single session when the server is busy or under maintenance.

As the feature request was backlogged almost a year ago, any update or timeline would be much appreciated. Thanks in advance.

pgorbar avatar Jul 06 '23 13:07 pgorbar

Any Updates on this? Now nearly 2 years, that you got this request. It seems so easy to just add an delay when we abandon a message. We introduced service bus as a means of decoupling message transfers to not 100% available systems, so sometimes they just fail, and we want to delay it for some minutes or hours, when backends arent available. We see now, that the azure service bus was the wrong design choice, as he is not improving, even for such basic features, where one would use such a system. If nothing goes on, we may need to switch to other solutions. So we would need some help from your side, to avoid this.

So what is the conceptual challenge of providing a delay in a abandon message, that would take over a year to think over. can you please describe, where the current designprocess is hanging, what are the challenges you cannot solve yet. maybe community can help?

larsilus avatar Jul 19 '23 10:07 larsilus

One way to accomplish this already today is to use message deferral combined with a scheduled message. For this, you would defer the message, and place it's sequence number in a scheduled message. When the scheduled message comes in, use the sequence number to retrieve the deferred message and process it. Please let us know if this works for you.

I think this is something that can be used to work-around with. However, it adds some additional administration and logic / plumbing to implement this.

The easiest way to implement this feature would indeed be an AbandonAsync method overload that allows you to specify a scheduledEnqueueTimeUtc value. I've just tried this, but this doesn't seem to work as the updated properties you can pass in to the AbandonAsync method are seen as 'custom message properties' instead.

fgheysels avatar Jul 20 '23 09:07 fgheysels

Providing a quick update on this. The design for this is ready, once this is picked up for development we will provide another update on this issue.

EldertGrootenboer avatar Jul 21 '23 16:07 EldertGrootenboer

Really glad to hear this is being worked on. I started working with Service Bus for the first time this week and found myself in need of this feature almost immediately. Can't wait for it to come out!

ultrabstrong avatar Aug 08 '23 19:08 ultrabstrong