azure-sdk-for-java icon indicating copy to clipboard operation
azure-sdk-for-java copied to clipboard

[QUERY] Service Bus client 7.15 :sessionIdleTimeout and maxAutoLockRenewDuration meaning

Open lidkowiak opened this issue 7 months ago • 7 comments

Query/Question We have following scenario:

  • processing messages with sessions (session per tenant)
  • limit number of concurrent sessions per virtual machine/JVM
  • within session, process only one message to preserve ordering

ATM we use sync code listed here https://github.com/Azure/azure-sdk-for-java/issues/35197 that fulfils requirements but facing issues with hanging clients that requires restart of the process that is nightmare.

We're thinking about switching to 7.15 version with maxConcurrentSessions config but facing issue with sessionIdleTimeout and possiblymaxAutoLockRenewDuration. That is illustrated by the following logs.

2023-11-16 11:00:57,445 INFO  [com.company.servicebus.MessageHandler] (receiver-1-29) Entity TOPIC_NAME SUB_1 messageId 54d7447fd91a47ea982090dc7d8b460f Session PMS START
2023-11-16 11:00:57,445 INFO  [com.company.servicebus.MessageHandler] (receiver-5-28) Entity TOPIC_NAME SUB_2 messageId 54d7447fd91a47ea982090dc7d8b460f Session PMS START
2023-11-16 11:00:58,318 INFO  [com.company.servicebus.MessageHandler] (receiver-1-29) Entity TOPIC_NAME SUB_1 messageId 54d7447fd91a47ea982090dc7d8b460f Session PMS FINISH
2023-11-16 11:01:40,965 INFO  [com.company.servicebus.MessageHandler] (receiver-5-28) Entity TOPIC_NAME SUB_2 messageId 54d7447fd91a47ea982090dc7d8b460f Session PMS FINISH
2023-11-16 11:01:57,446 INFO  [com.azure.messaging.servicebus.ServiceBusSessionReceiver] (parallel-2) {"az.sdk.message":"Did not a receive message within timeout.","sessionId":"PMS","entityPath":"TOPIC_NAME/subscriptions/SUB_1","linkName":"session-_939b07_1700132453288","timeout":"PT1M"}
2023-11-16 11:01:57,461 INFO  [com.azure.messaging.servicebus.ServiceBusSessionReceiver] (parallel-2) {"az.sdk.message":"Did not a receive message within timeout.","sessionId":"PMS","entityPath":"TOPIC_NAME/subscriptions/SUB_2","linkName":"session-_400e82_1700132454099","timeout":"PT1M"}

Looks like sessionIdleTimeout needs to reflect max time of message processing. We have cases when single topic has messages that require different processing time (from milliseconds to minutes). In this case we need to set it up to max expected time that results in sessions being idle for most of the time (SUB_2 case). I would expect that:

  • sessionIdleTimeout should be measured related to the last operation within session e.g. message complete/abandon operation
  • maxAutoLockRenewDuration should be used to renew session lock but it looks like it's ignored atm

Am I missing something? What would be the recommended setup for provided scenario? We want to control how long we wait for a message within session but at the moment it's correlated to message processing time.

Why is this not a Bug or a feature Request?

Setup (please complete the following information if applicable):

  • OS: MacOs/Windows (not related to raised issue)
  • IDE: IntelliJ (not related to raised issue)
  • Library/Libraries: [e.g. com.azure:azure-core:1.16.0 (groupId:artifactId:version)]
            <dependency>
                <groupId>com.azure</groupId>
                <artifactId>azure-messaging-servicebus</artifactId>
                <version>7.15.0-beta.4</version>
            </dependency>
            <dependency>
                <groupId>com.azure</groupId>
                <artifactId>azure-core-amqp</artifactId>
                <version>2.9.0-beta.6</version>
            </dependency>

Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • [x] Query Added
  • [x] Setup information Added

lidkowiak avatar Nov 16 '23 13:11 lidkowiak

Hi @lidkowiak thanks for reaching out via this github issue. @anuchandy could you please follow up?

/cc @Azure/azsdk-sb-java

joshfree avatar Nov 27 '23 15:11 joshfree

Hello @lidkowiak, in all versions of SDK (including 7.15-*), the sessionIdleTimeout needs to be a value greater than the maximum processing time of a message. The javaDoc

After the processor delivers a message to the processMessage(Consumer) handler, if the processor is unable to receive the next message from the session because there is no next message in the session or processing the current message takes longer than the sessionIdleTimeout then the session will time out. To avoid inadvertently losing sessions, choose a sessionIdleTimeout greater than the processing time of a message.


What would be the recommended setup for provided scenario?

Your scenario can be achieved by using the Low-Level Reactive Receiver with some customization. I looked into your scenario (that you added in the linked git issue) and implemented this customized solution.

You can find the complete solution here: https://github.com/anuchandy/servicebus-session-timeout-rolling. The solution uses design concepts similar to ServiceBusProcessorClient has underneath. It takes dependency on 7.15.0-beta.5 and uses v2 Low-Level Reactive Receiver. (See using v2 in beta).

The class App in the solution shows the usage.

Can you take a look, adjust the parameters per your application requirement (note on the importance of coarse timeout), and see if it can be tested on QA/Pre-Prod?

Also curious, could you let me know the following -

  1. the "maximum number of sessions" that you’re trying to process concurrently in a VM?
  2. the configuration of the VM (core, memory) running this concurrent processing?
  3. in general how long (days, weeks, months) does it take for the current solution to hang after deployment, requiring the restart?
  4. What is the bus service tier?

anuchandy avatar Nov 29 '23 16:11 anuchandy

Thanks for the reply. I'm going to analyse provided sample.

Answers to your questions:

ad 1) Up to 58 concurrent sessions for 10+ topics. Each topic has its own load characteristic. ad 2) We're facing issue on WIndows VMs: Standard_D4s_v3, Standard_D8s_v3 with JRE 8 ad 3) It depends but generally speaking in a week after deployment/restart it's guaranteed to observe the issue. ad 4) Service Bus SKU: Standard

lidkowiak avatar Nov 30 '23 12:11 lidkowiak

@lidkowiak, how is your production run so far with the pattern we discussed above?

anuchandy avatar Feb 13 '24 19:02 anuchandy

Hi @lidkowiak. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

github-actions[bot] avatar Feb 16 '24 17:02 github-actions[bot]

Well, the solution is pretty low level. Would it be possible to include infrastructure bit in SB SDK? I believe it's pretty common case to support with SB sessions.

lidkowiak avatar Feb 16 '24 18:02 lidkowiak

thanks for getting back. Right now, we don't have a plan to alter how sessionIdleTimeout works. At present, the sessionIdleTimeout has to match the maximum processing time if processor is used directly. Alternatively, you can use the sample solution in the previous comment, which should give you the behavior that you wanted. However, we might rethink the sessionIdleTimeout behavior in the future, but it’s in backlog.

anuchandy avatar Feb 28 '24 17:02 anuchandy

Resolving. Clarified the query, shared sample code for the use case.

Please also look at this documentation https://learn.microsoft.com/en-us/azure/developer/java/sdk/troubleshooting-messaging-service-bus-overview#upgrade-to-715x on 7.15.x upgrade (7.15.1 as of today) and the "new session opt-in".

The sample code already includes "session opt-in", so there is no need to modify any code, only change the dependency (beta to stable). If Processor is used directly, ensure that session is opted-in as the documentation explains.

anuchandy avatar Mar 04 '24 17:03 anuchandy

Hi @anuchandy 👋🏼 As we're facing the same challenge in our setup regarding sessionIdleTimeout in combination with varying (unpredictable) processing times of consumers on session-enabled SB queues (in our case seconds to 15 minutes or more), I would like to re-confirm a few points regarding your provided solution, before we go ahead with a PoC:

  • the solution will apply the sessionIdleTimeout individually to each (concurrent) session receiver, so that the configured timeout will be applied after the last message for the given session has been processed, essentially removing the need to configure the sessionIdleTimeout according to the longest (anticipated) overall processing time. Is this understanding correct?
  • in the ServiceBusSessionProcessor from your linked example you implemented a load balancing of concurrency across the pumps from the connections: is this an additional optimization or needed for the desired behavior?
  • are you expecting any issues when trying this with the upstream 7.15.x dependency rather than the (older) beta version of the SDK?

Thanks!

setema avatar Apr 12 '24 16:04 setema

Hello @setema, please see below,

  1. You’re right, the timer starts only after the application returns control from the onMessage handler, and the timer is cancelled if the next message arrives before it expires. If it expires while waiting for the next message arrival, then the session will be closed and an attempt to obtain another session is made.

  2. If there are many sessions, then it’s a good idea to load balance.

    • The solution offers an "in-proc load balancing" to limit the number of sessions open-ed in one TCP connection. As you pointed out, in the solution, the number of sessions in one TCP connection is limited to 30, so if there are 75 sessions expected then there will be 3 TCP connections. The reason for balancing is, each TCP connection and all internal activities (receive, auth-token refresh, management operations, disposition) of sessions it hosts must be multiplexed on single IO-Thread. Often this one IO-Thread per Connection becomes a bottle neck for performance and lags in critical path (e.g. impacting timely recovery on transient errors).
    • If there are many sessions, another (recommended) option is to do "out-proc load balancing" where we run application in multiple independent nodes, with small number of sessions per node, this also has the advantage of one node picking up sessions when another self-healing.

    Also, Lukasz (who asked the original question) has two machines (8, 4 cores) with 50 sessions, i.e., a limited number of nodes hence in-proc was reasonable.

  3. Yes, please use the latest 7.15.x GA-ed version, the solution used beta because at that time GA was not done. Also remember to keep the v2 opt-in.

anuchandy avatar Apr 12 '24 23:04 anuchandy

Thanks for the quick clarification @anuchandy we will try out the custom solution.

setema avatar Apr 15 '24 08:04 setema