azure-sdk-for-net icon indicating copy to clipboard operation
azure-sdk-for-net copied to clipboard

[BUG] Session MaxAutoLockRenewalDuration and SessionIdleTimeout values causing queue starvation

Open bramley-jetcharge opened this issue 1 year ago • 7 comments
trafficstars

Library name and version

Azure.Messaging.ServiceBus 7.17.2

Describe the bug

I have two separate but likely related issues with Azure.Messaging.ServiceBus 7.17.2 I am sending a simple $"hello {i}" message every x seconds to a session-enabled queue, using a fixed sessionId (eg "egg" or "spoon"). The queue's lock duration is set to 1 minute. Receiver code snippets below (or see gist):

var options = new ServiceBusSessionProcessorOptions
{
    AutoCompleteMessages = true,
    MaxConcurrentSessions = 1000,
    MaxConcurrentCallsPerSession = 2,
     MaxAutoLockRenewalDuration = TimeSpan.FromMinutes(6),
    SessionIdleTimeout = TimeSpan.FromMinutes(2),
};
// create a session processor that we can use to process the messages
await using ServiceBusSessionProcessor processor = client.CreateSessionProcessor(queueName, options);

// configure the message and error event handler to use - these event handlers are required
processor.ProcessMessageAsync += MessageHandler;
processor.SessionInitializingAsync += SessionInitializingHandler;
processor.SessionClosingAsync += SessionClosingHandler;

async Task MessageHandler(ProcessSessionMessageEventArgs args)
{
    var now = DateTime.UtcNow;
    var body = args.Message.Body.ToString();
    var lag = now - args.Message.EnqueuedTime;
    var lockDuration = args.SessionLockedUntil - now;
    await Task.CompletedTask;
    Console.WriteLine($"{now}: {args.SessionId}: {body}. Lag: {lag.TotalSeconds}s. Lock duration: {lockDuration.TotalSeconds}s");
}

Expected behavior

Issue 1

Setup

MaxAutoLockRenewalDuration = TimeSpan.FromMinutes(6), SessionIdleTimeout = TimeSpan.FromMinutes(2), and sending messages every 31 seconds

Expected Behaviour

  • The session should close after 6 minutes (and then reopen on a new session handler).
  • SessionLockedUntil should always be in the future

Issue 2

Setup

MaxAutoLockRenewalDuration = TimeSpan.FromMinutes(10), SessionIdleTimeout = TimeSpan.FromMinutes(2), and sending messages every 110 seconds

Expected Behaviour

  • Messages are consistently received on the same session handler, without delay, for at least 10 minutes

Actual behavior

Issue 1

Actual Behaviour

After 6+ minutes, messages continue to be received on the same handler. The ProcessSessionMessageEventArgs.SessionLockedUntil time becomes negative. I left this run for an hour, the lock time never increased.

MaxAutoLockRenewalDuration is "The maximum duration during which session locks are automatically renewed". Considering I have no code to explicitly renew a lock, shouldn't it expire after 6 minutes, triggering SessionClosingAsync, then SessionInitializingAsync (either on this instance or another, when running multiple instances)? I also found that if I stop sending messages, the session closes after 2 minutes, as I expect based on SessionIdleTimeout.

Issue 2

Actual Behaviour

Msg 0 is received, creating the session. Msg 1 is received 110s later. At 3:27:00 Msg 2 is sent, but is not yet received. at 3:27:10, the session is closed, then a new one opened, and then Msg 2 is received, 10 seconds late.

15/02/2024 3:23:20 AM: SessionInitializingHandler egg
15/02/2024 3:23:20 AM: egg: Msg 0. Lag: 0.5027955s. Lock duration: 59.5133624s
15/02/2024 3:25:10 AM: egg: Msg 1. Lag: 0.4167196s. Lock duration: 48.8482804s
15/02/2024 3:27:10 AM: SessionClosingHandler egg
15/02/2024 3:27:10 AM: SessionInitializingHandler egg
15/02/2024 3:27:10 AM: egg: Msg 2. Lag: 10.5118409s. Lock duration: 59.5833451s
15/02/2024 3:28:50 AM: egg: Msg 3. Lag: 0.40434s. Lock duration: 58.93966s

This behaviour repeats, with session switching and 10s lag every third message.

I then tried a few other combinations of send interval and SessionIdleTimeout:

  • send every 55s, SessionIdleTimeout 120s: Works correctly, no lag.
  • send every 65s, SessionIdleTimeout 120s: first msgs 0 and 1, 65s, received. Msg 2 received 120s after msg 1, following SessionClosingHandler then SessionInitializingHandler. Meaning the msg is received 55s late. This particular case did eventually stabilise, however with a negative lock duration
  • send every 55s, SessionIdleTimeout 60s: first msgs 0 and 1, 55s, received. Msg 2 received 60s after msg 1, following SessionClosingHandler then SessionInitializingHandler. Meaning the msg is received 5s late.
  • send every 130s, SessionIdleTimeout 4 mins: first msgs 0 and 1, 130s, received. Msg 2 received 4min after msg 1, following SessionClosingHandler then SessionInitializingHandler. Meaning the msg is received 110s late.

From this I conclude that the session receiver correctly holds the session lock until last successful receive time plus SessionIdleTimeout, but stops actually invoking ProcessMessageAsync if msg is received less frequently than half the SessionIdleTimeout value. Depending on the interval and timeout value, holding the lock without processing pending messages can lead to significant lag/starvation of the queue.

Reproduction Steps

Gist adapted from this sample: https://github.com/Azure/azure-sdk-for-net/blob/main/sdk/servicebus/Azure.Messaging.ServiceBus/samples/Sample05_SessionProcessor.md : https://gist.github.com/bramley-jetcharge/2b0466fee777c3ea49beb620ee76c533

Environment

.NET SDK (reflecting any global.json): Version: 6.0.400 Commit: 7771abd614

Runtime Environment: OS Name: Windows OS Version: 10.0.19045 OS Platform: Windows RID: win10-x64 Base Path: C:\Program Files\dotnet\sdk\6.0.400\

bramley-jetcharge avatar Feb 16 '24 03:02 bramley-jetcharge

Thank you for your feedback. Tagging and routing to the team member best able to assist.

jsquire avatar Feb 16 '24 14:02 jsquire

@bramley-jetcharge, there was a bug in 7.17.2 that may be related to the issues you are facing. Can you try to repro on 7.17.3?

JoshLove-msft avatar Feb 16 '24 15:02 JoshLove-msft

Hi @bramley-jetcharge. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

github-actions[bot] avatar Feb 16 '24 15:02 github-actions[bot]

Thanks, looks like you released 7.17.3 right after I started testing! It did fix issue 2, but I am still seeing issue 1: MaxAutoLockRenewalDuration = TimeSpan.FromMinutes(6), SessionIdleTimeout = TimeSpan.FromMinutes(4), and sending messages every 31 seconds

After 6+ minutes, messages continue to be received on the same handler. The ProcessSessionMessageEventArgs.SessionLockedUntil time becomes negative. I left this run for an hour, the lock time never increased.

sample output:

17/02/2024 2:45:40 AM: SessionInitializingHandler egg
17/02/2024 2:45:40 AM: egg: Msg 0. Lag: 1.8969068s. Lock duration: 58.1191186s
17/02/2024 2:46:11 AM: egg: Msg 1. Lag: 1.6884163s. Lock duration: 27.2616091s
17/02/2024 2:46:42 AM: egg: Msg 2. Lag: 1.6716996s. Lock duration: 44.7163004s
17/02/2024 2:47:13 AM: egg: Msg 3. Lag: 1.6744682s. Lock duration: 13.6685318s
17/02/2024 2:47:44 AM: egg: Msg 4. Lag: 1.7002465s. Lock duration: 30.9367535s
17/02/2024 2:48:15 AM: egg: Msg 5. Lag: 1.6760181s. Lock duration: 48.2279819s
17/02/2024 2:48:46 AM: egg: Msg 6. Lag: 1.6731483s. Lock duration: 17.1618517s
17/02/2024 2:49:17 AM: egg: Msg 7. Lag: 1.6935753s. Lock duration: 34.5224247s
17/02/2024 2:49:48 AM: egg: Msg 8. Lag: 1.6969981s. Lock duration: 51.8180019s
17/02/2024 2:50:19 AM: egg: Msg 9. Lag: 1.6727394s. Lock duration: 20.7732606s
17/02/2024 2:50:50 AM: egg: Msg 10. Lag: 1.6940355s. Lock duration: 38.1409645s
17/02/2024 2:51:21 AM: egg: Msg 11. Lag: 1.7091079s. Lock duration: 55.4148921s
17/02/2024 2:51:52 AM: egg: Msg 12. Lag: 1.6725705s. Lock duration: 24.3744295s
17/02/2024 2:52:23 AM: egg: Msg 13. Lag: 1.6772244s. Lock duration: -6.6662244s
17/02/2024 2:52:55 AM: egg: Msg 14. Lag: 1.682253s. Lock duration: -37.721253s
17/02/2024 2:53:26 AM: egg: Msg 15. Lag: 1.6765933s. Lock duration: -68.7735933s
17/02/2024 2:53:57 AM: egg: Msg 16. Lag: 1.6731629s. Lock duration: -99.8221629s
17/02/2024 2:54:28 AM: egg: Msg 17. Lag: 1.6731869s. Lock duration: -130.8631869s
17/02/2024 2:54:59 AM: egg: Msg 18. Lag: 1.6877771s. Lock duration: -161.9017771s
17/02/2024 2:55:30 AM: egg: Msg 19. Lag: 1.6800192s. Lock duration: -192.9430192s
17/02/2024 2:56:01 AM: egg: Msg 20. Lag: 1.6842333s. Lock duration: -223.9862333s
17/02/2024 2:56:32 AM: egg: Msg 21. Lag: 1.7100187s. Lock duration: -255.0480187s
17/02/2024 2:57:03 AM: egg: Msg 22. Lag: 1.6784764s. Lock duration: -286.1474764s
17/02/2024 2:57:34 AM: egg: Msg 23. Lag: 1.6831129s. Lock duration: -317.1941129s
17/02/2024 2:58:05 AM: egg: Msg 24. Lag: 1.6806557s. Lock duration: -348.2446557s
17/02/2024 2:58:36 AM: egg: Msg 25. Lag: 1.6857493s. Lock duration: -379.2857493s
17/02/2024 2:59:07 AM: egg: Msg 26. Lag: 1.6888228s. Lock duration: -410.3358228s
17/02/2024 2:59:38 AM: egg: Msg 27. Lag: 1.6901601s. Lock duration: -441.3761601s
17/02/2024 3:00:09 AM: egg: Msg 28. Lag: 1.6780946s. Lock duration: -472.4340946s
17/02/2024 3:00:40 AM: egg: Msg 29. Lag: 1.6834788s. Lock duration: -503.4764788s
17/02/2024 3:01:11 AM: egg: Msg 30. Lag: 1.6848989s. Lock duration: -534.5108989s
17/02/2024 3:01:42 AM: egg: Msg 31. Lag: 1.6769508s. Lock duration: -565.5599508s
17/02/2024 3:02:13 AM: egg: Msg 32. Lag: 1.7033479s. Lock duration: -596.6203479s
17/02/2024 3:02:44 AM: egg: Msg 33. Lag: 1.6845338s. Lock duration: -627.6695338s
17/02/2024 3:03:16 AM: egg: Msg 34. Lag: 1.6864937s. Lock duration: -658.7074937s
17/02/2024 3:03:47 AM: egg: Msg 35. Lag: 1.6874179s. Lock duration: -689.7534179s
17/02/2024 3:04:18 AM: egg: Msg 36. Lag: 1.692409s. Lock duration: -720.808409s
17/02/2024 3:04:49 AM: egg: Msg 37. Lag: 1.6867796s. Lock duration: -751.8567796s
17/02/2024 3:05:20 AM: egg: Msg 38. Lag: 1.685359s. Lock duration: -782.911359s
17/02/2024 3:05:51 AM: egg: Msg 39. Lag: 1.6918786s. Lock duration: -813.9578786s
17/02/2024 3:06:22 AM: egg: Msg 40. Lag: 1.7100995s. Lock duration: -845.0340995s
17/02/2024 3:06:53 AM: egg: Msg 41. Lag: 1.6885905s. Lock duration: -876.0795905s
17/02/2024 3:07:24 AM: egg: Msg 42. Lag: 1.6984153s. Lock duration: -907.1304153s
17/02/2024 3:07:55 AM: egg: Msg 43. Lag: 1.7071609s. Lock duration: -938.2871609s
17/02/2024 3:08:26 AM: egg: Msg 44. Lag: 1.6953873s. Lock duration: -969.3753873s
17/02/2024 3:08:57 AM: egg: Msg 45. Lag: 1.690651s. Lock duration: -1000.441651s
17/02/2024 3:09:28 AM: egg: Msg 46. Lag: 1.6910603s. Lock duration: -1031.4970603s
17/02/2024 3:09:59 AM: egg: Msg 47. Lag: 1.6923991s. Lock duration: -1062.5333991s
17/02/2024 3:10:30 AM: egg: Msg 48. Lag: 1.68681s. Lock duration: -1093.59881s
17/02/2024 3:11:01 AM: egg: Msg 49. Lag: 1.6951195s. Lock duration: -1124.6461195s
17/02/2024 3:11:33 AM: egg: Msg 50. Lag: 1.6816297s. Lock duration: -1155.6886297s
17/02/2024 3:12:04 AM: egg: Msg 51. Lag: 1.7438724s. Lock duration: -1186.7878724s
17/02/2024 3:12:35 AM: egg: Msg 52. Lag: 1.6974936s. Lock duration: -1217.7934936s
...
17/02/2024 3:36:58 AM: egg: Msg 99. Lag: 1.7057607s. Lock duration: -2680.9357607s
//sender stopped
17/02/2024 3:40:58 AM: SessionClosingHandler egg

bramley-jetcharge avatar Feb 17 '24 03:02 bramley-jetcharge

After 6+ minutes, messages continue to be received on the same handler. The ProcessSessionMessageEventArgs.SessionLockedUntil time becomes negative.

I think you are saying that the SessionLockedUntil time ends up being in the past yet you are still receiving messages for the session. The service allows the session to remain locked as long as the session is being actively received from even if the lockedUntil time has past. So this is expected behavior. If you want the session to close early you can use the ReleaseSession method.

JoshLove-msft avatar Feb 17 '24 04:02 JoshLove-msft

Hi @bramley-jetcharge. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text "/unresolve" to remove the "issue-addressed" label and continue the conversation.

github-actions[bot] avatar Feb 17 '24 04:02 github-actions[bot]

/unresolve

Hi, thanks for the info. So a session will remain locked so long as my application is running and I haven't released it, regardless of MaxAutoLockRenewalDuration value? In that case, what is the purpose of this setting? I wrote my code initially just to understand the difference between this and SessionIdleTimeout, and how I should configure them for my scenario:

I will have several thousand IoT devices, each sending frequent msgs to their own session. I will have several receiver instances, each with MaxConcurrentSessions = 1000, MaxAutoLockRenewalDuration = Timeout.Infinite (?), SessionIdleTimeout = TimeSpan.FromMinutes(60). The intention is for the processor to be locked to a session indefinitely (days or weeks), until either it stops sending msgs (>60 mins), or I explicitly ReleaseSession() it. I want to ensure msgs are always processed promptly, with no lag or starvation.

Do you see any issues with this approach?

Thanks very much

bramley-jetcharge avatar Feb 19 '24 01:02 bramley-jetcharge

Hi, thanks for the info. So a session will remain locked so long as my application is running and I haven't released it, regardless of MaxAutoLockRenewalDuration value? In that case, what is the purpose of this setting?

The behavior is up to the service implementation of the feature, but essentially this is what happens. The setting is useful in the case that your message processing takes more time than the actual receiving of messages. If that is your scenario, you wouldn't be actively receiving during the time that you are processing the message (i.e. after the message has already been received but before the processor attempts to receive the next message).

I will have several thousand IoT devices, each sending frequent msgs to their own session. I will have several receiver instances, each with MaxConcurrentSessions = 1000, MaxAutoLockRenewalDuration = Timeout.Infinite (?), SessionIdleTimeout = TimeSpan.FromMinutes(60). The intention is for the processor to be locked to a session indefinitely (days or weeks), until either it stops sending msgs (>60 mins), or I explicitly ReleaseSession() it. I want to ensure msgs are always processed promptly, with no lag or starvation.

A MaxConcurrentSessions value of 1000 may lead to thread starvation. I'd suggest starting with a lower value and then tuning to reach desired throughput/queue message latency. Other than that, I think this approach makes sense.

JoshLove-msft avatar Feb 25 '24 18:02 JoshLove-msft

Hi @bramley-jetcharge. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

github-actions[bot] avatar Feb 25 '24 18:02 github-actions[bot]

Hi @bramley-jetcharge, we're sending this friendly reminder because we haven't heard back from you in 7 days. We need more information about this issue to help address it. Please be sure to give us your input. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

github-actions[bot] avatar Mar 03 '24 21:03 github-actions[bot]