azure-sdk-for-net
azure-sdk-for-net copied to clipboard
[BUG] transient Service Bus Session Processor ServiceCommunicationProblem errors cause doubling of memory use
Library name and version
Azure.Messaging.ServiceBus 7.18.2
Describe the bug
We are seeing an issue where transient connection errors to Service Bus (ServiceCommunicationProblem) result in huge memory spikes (50%+) on our session receiver. This is a problem because we're uncertain how much we need to over-provision our k8s pods so they aren't restarted when reaching limits.
Expected behavior
We need to understand resource requirements, so we know how much to over-provision.
Actual behavior
We currently have a k8s cluster with a few hundred devices/session processors per instance, and this works perfectly well until these transient ServiceCommunicationProblem errors occur every few days, which more than doubles the memory usage, exceeding the quota and causing k8s to restart the instance (the errors usually only occur in one or two instances at once, but not all of them).
Background
We are using code similar to the repro below in the gateway side of a system handling connections to thousands of IoT devices (I previously outlined our service side here: https://github.com/Azure/azure-sdk-for-net/issues/42022#issuecomment-1951533511). Whenever a device connects to our gateway, we need a session processor with a session lock for that specific deviceId (to receive messages from our service to the device). We use an infinite renewal because we never want the lock to expire - we only ever want to lose the lock when the device disconnects and we call processor.CloseAsync(). We want all the library's in-built backoff/retry mechanisms to keep the session lock, and to re-obtain it when necessary. We never want it to give up, because the device itself is still connected to us.
Questions
- Are these transient SB errors expected? They seem to cause error logs for up to 10 minutes, does that mean SB was inaccessible to that instance for that period, or is that a side effect? If it's an issue with SB itself, why aren't all our instances affected at once? Would upgrading from Standard to Premium tier eliminate these issues?
- Are transient SB errors expected to cause a doubling of memory usage? Should the usage go down once connectivity is restored?
- I wonder if the library should somehow detect that all several hundred processors are in error for the same transient SB reason affecting the single SB queue, and better 'pool' the retries without doubling the memory footprint?
- We also see
ErrorSource: RenewLock ex: Azure.Messaging.ServiceBus.ServiceBusException: The session lock was lost. Request a new session receiver. (SessionLockLost)inProcessErrorAsync. Is "Request a new session receiver" an instruction, ie I should dispose of this session and request a new one? Or should it keep retrying to get it back? (again we only want the lock to be released when the device disconnects, and we want to use the in-built retry).
I've read troubleshooting.md, and will continue experimenting with TryTimeout, however it's not clear if that guidance is relevant given our MaxConcurrentSessions = 1, and that everything works fine outside of these transient SB connection errors. Our message processing time is also very short (all we do is forward SB message to device, few milliseconds).
Reproduction Steps
Session receiver setup:
var options = new ServiceBusSessionProcessorOptions
{
AutoCompleteMessages = true,
MaxConcurrentSessions = 1,
MaxConcurrentCallsPerSession = 1,
MaxAutoLockRenewalDuration = Timeout.InfiniteTimeSpan,
SessionIdleTimeout = TimeSpan.FromMinutes(5),
SessionIds = { deviceId },
Identifier = connectionId
};
ServiceBusSessionProcessor processor = client.CreateSessionProcessor("GatewayRx", options);
processor.ProcessMessageAsync += SessionMessageHandler;
processor.ProcessErrorAsync += SessionErrorHandler;
await processor.StartProcessingAsync();
After the receiver acquires a session lock (for deviceId) and starts receiving messages, temporarily block access to the Service Bus namespace from your IP via Azure Portal -> Service Bus resource -> Settings-Networking. You should then see a significant spike in memory usage of the application, which does not seem to come down once access is restored (even after several hours).
This seems to be a reasonable way to trigger the issue, but I'm not sure how equivalent it is to what occurs naturally every few hours-days.
Environment
.NET SDK: Version: 8.0.205 Commit: 3e1383b780 Workload version: 8.0.200-manifests.818b3449
Repro'd using: Runtime Environment: OS Name: Windows OS Version: 10.0.19045 OS Platform: Windows RID: win-x64 Base Path: C:\Program Files\dotnet\sdk\8.0.205\
Prod k8s is Ubuntu.
Hello @bramley-jetcharge. I'm an AI assistant for the azure-sdk-for-net repository. I have some suggestions that you can try out while the team gets back to you.
• Transient ServiceCommunicationProblem errors can occur as part of normal retry behavior, but a sustained doubling of memory isn’t expected.
• Verify that the built‑in retry/backoff (review the TROUBLESHOOTING documentation here) isn’t compounding resource usage during prolonged connection issues.
• Consider whether an infinite auto‑lock renewal (MaxAutoLockRenewalDuration = Timeout.InfiniteTimeSpan) might be contributing to unreleased buffers during network interruptions. Experiment with a finite duration to see if memory recovery improves once connectivity is restored.
• The error message regarding a lost session lock ("Request a new session receiver") suggests that when a lock is lost, the context should be re‐initialized. Confirm that your error handler is properly disposing of the affected session receiver and relying on the built‑in retry, or explicitly re-creating the processor if needed.
• If multiple processors experience these transient issues unevenly, it may be worth exploring a pooling mechanism or shared connections to reduce per-receiver overhead.
• Upgrading to Premium Service Bus may help isolate and mitigate transient communication errors, as Premium provides predictable performance and a dedicated resource model.
• Finally, consider increasing logging/telemetry around the error scenarios to determine if leaked resources or unusually long retry periods correlate with the memory spikes.
The team will get back to you shortly, hopefully this helps in the meantime.
Hi @bramley-jetcharge. Thanks for reaching out and we regret that you're experiencing difficulties. ServiceCommunicationProblem is a generic reason code for when there's no service-specific meaning and usually indicates a general network failure. The inner exception(s) and stack traces would likely offer better context.
The overall scenario of having a large number of instances running and a network failure that starts to occur more frequently as things are running very much aligns with behavior that we see when the host machine does not have adequate resources for the degree of concurrency and/or network use taking place. This may be due to SNAT port failures or contention/queuing of async continuations in the thread pool causing stutters or timeouts. Here, again, the inner exception(s) and stack traces may help. Grabbing memory dumps and analyzing the thread pool, async state, and network state would likely be the best path for troubleshooting.
To try to answer your specific questions:
- Are these transient SB errors expected? They seem to cause error logs for up to 10 minutes, does that mean SB was inaccessible to that instance for that period, or is that a side effect? If it's an issue with SB itself, why aren't all our instances affected at once? Would upgrading from Standard to Premium tier eliminate these issues?
No, this is not expected. Most often we see these related to the application host's resources. As such, I wouldn't expect that upgrading the Service Bus SKU would help, but it cannot be ruled out entirely without a deeper understanding of the root cause. For example, reducing latency and improving throughput may help clear some contention and may keep things moving - or it may not.
- Are transient SB errors expected to cause a doubling of memory usage? Should the usage go down once connectivity is restored?
Generally, memory spikes like this occur when applications are processing large messages or are holding a large number of small messages in a batch/prefetch. When a network failure takes place, the messages held in the AMQP transport queue from in-flight transfers and dispatches take up space until disposal takes place. In your scenario, this could also be due to failures happening across a number of your process instances concurrently, as would happen for SNAT port exhaustion or messages held in memory waiting for their async continuations to run as would happen with thread pool queueing.
- I wonder if the library should somehow detect that all several hundred processors are in error for the same transient SB reason affecting the single SB queue, and better 'pool' the retries without doubling the memory footprint?
The library intentionally treats instances individually. They do not know of one another nor coordinate. Likewise, each error is observed by the processor as a single point-in-time occurrence - it has no awareness of error history or patterns. The processor does not understand enough about the host environment or application to automatically apply a heuristic across error patterns to alter that behavior. The processor is likely to recover from that specific instance of the error but, in aggregate, there may need to consider a wider problem. As a result, if you're seeing frequent exceptions in your handler or consistent failure patterns, the application will need to diagnose and remediate.
- We also see ErrorSource: RenewLock ex: Azure.Messaging.ServiceBus.ServiceBusException: The session lock was lost. Request a new session receiver. (SessionLockLost) in ProcessErrorAsync. Is "Request a new session receiver" an instruction, ie I should dispose of this session and request a new one? Or should it keep retrying to get it back? (again we only want the lock to be released when the device disconnects, and we want to use the in-built retry).
For your scenario, this is another strong indicator of resource contention and the thread pool being unable to schedule async continuations quickly enough to keep processors in a healthy state. Most likely, the persistent background task responsible for renewing the lock is not able to run consistently on schedule or is encountering network delays/failures performing the renewal.
Hi @bramley-jetcharge. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.
Thankyou for the quick response, you've given us a lot of leads to track down!
For now tho just to clarify: we can reproduce the memory jump locally using just a single processor and a single message doing zero work (the repro steps above), and the usage stays high even once comms are restored.
We'll experiment with a MaxAutoLockRenewalDuration other than Infinite, but that brings me back to Q4: how do we recognise when the library has finally 'given up' on retrying to renew a session lock? Originally I expected to only get this via SessionClosingAsync and to do my cleanup there, but the docs for that handler do not indicate it would be called in this situation.
Here's an extract of the traces from a prod incident. There were many thousands of errors within a few minutes.
TimeGenerated [UTC] LogMessage
18/03/2025, 21:47:17.359 fail: JETCharge.WebsocketGateway.Router[0]
18/03/2025, 21:47:17.361 chargepoint-websocket-gateway-service-1-556965bdb-abcd_device1234_1df3f4b3 Service Bus session error. Source: RenewLock ex: Azure.Messaging.ServiceBus.ServiceBusException: The session lock was lost. Request a new session receiver. (SessionLockLost). For troubleshooting information, see https://aka.ms/azsdk/net/servicebus/exceptions/troubleshoot.
18/03/2025, 21:47:17.361 ---> Azure.Messaging.ServiceBus.ServiceBusException: The write operation failed, see inner exception. ErrorCode: Shutdown (ServiceCommunicationProblem). For troubleshooting information, see https://aka.ms/azsdk/net/servicebus/exceptions/troubleshoot.
18/03/2025, 21:47:17.361 ---> System.IO.IOException: The write operation failed, see inner exception.
18/03/2025, 21:47:17.361 ---> System.Net.Sockets.SocketException (32): Broken pipe
18/03/2025, 21:47:17.361 at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
18/03/2025, 21:47:17.361 --- End of stack trace from previous location ---
18/03/2025, 21:47:17.361 at System.Net.Security.SslStream.<WriteSingleChunk>g__CompleteWriteAsync|157_1[TIOAdapter](ValueTask writeTask, Byte[] bufferToReturn)
18/03/2025, 21:47:17.361 at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
18/03/2025, 21:47:17.361 --- End of inner exception stack trace ---
18/03/2025, 21:47:17.361 at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
18/03/2025, 21:47:17.362 at Microsoft.Azure.Amqp.Transport.TlsTransport.HandleOperationComplete(IAsyncResult result, Boolean write, Boolean syncComplete)
18/03/2025, 21:47:17.362 --- End of stack trace from previous location ---
18/03/2025, 21:47:17.362 at Microsoft.Azure.Amqp.AsyncResult.End[TAsyncResult](IAsyncResult result)
18/03/2025, 21:47:17.362 at Microsoft.Azure.Amqp.AmqpObject.OnSafeCloseComplete(IAsyncResult result)
18/03/2025, 21:47:17.362 --- End of stack trace from previous location ---
18/03/2025, 21:47:17.362 at Microsoft.Azure.Amqp.AsyncResult.End[TAsyncResult](IAsyncResult result)
18/03/2025, 21:47:17.362 at Microsoft.Azure.Amqp.AmqpObject.OnSafeCloseComplete(IAsyncResult result)
18/03/2025, 21:47:17.362 --- End of stack trace from previous location ---
... repeated thousands of times...
18/03/2025, 21:47:17.370 at Azure.Messaging.ServiceBus.Amqp.AmqpReceiver.ExecuteRequest(TimeSpan timeout, AmqpRequestMessage amqpRequestMessage)
18/03/2025, 21:47:17.370 at Azure.Messaging.ServiceBus.Amqp.AmqpReceiver.RenewSessionLockInternal(TimeSpan timeout)
18/03/2025, 21:47:17.370 at Azure.Messaging.ServiceBus.Amqp.AmqpReceiver.<>c.<<RenewSessionLockAsync>b__66_0>d.MoveNext()
18/03/2025, 21:47:17.370 --- End of stack trace from previous location ---
18/03/2025, 21:47:17.370 at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.RunOperation[T1,TResult](Func`4 operation, T1 t1, TransportConnectionScope scope, CancellationToken cancellationToken, Boolean logTimeoutRetriesAsVerbose)
18/03/2025, 21:47:17.370 at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.RunOperation[T1,TResult](Func`4 operation, T1 t1, TransportConnectionScope scope, CancellationToken cancellationToken, Boolean logTimeoutRetriesAsVerbose)
18/03/2025, 21:47:17.370 at Azure.Messaging.ServiceBus.Amqp.AmqpReceiver.RenewSessionLockAsync(CancellationToken cancellationToken)
18/03/2025, 21:47:17.370 at Azure.Messaging.ServiceBus.ServiceBusSessionReceiver.RenewSessionLockAsync(CancellationToken cancellationToken)
18/03/2025, 21:47:17.370 at Azure.Messaging.ServiceBus.SessionReceiverManager.RenewSessionLock() msg: The session lock was lost. Request a new session receiver. (SessionLockLost). For troubleshooting information, see https://aka.ms/azsdk/net/servicebus/exceptions/troubleshoot.
18/03/2025, 21:47:17.370 Azure.Messaging.ServiceBus.ServiceBusException: The session lock was lost. Request a new session receiver. (SessionLockLost). For troubleshooting information, see https://aka.ms/azsdk/net/servicebus/exceptions/troubleshoot.
18/03/2025, 21:47:17.370 ---> Azure.Messaging.ServiceBus.ServiceBusException: The write operation failed, see inner exception. ErrorCode: Shutdown (ServiceCommunicationProblem). For troubleshooting information, see https://aka.ms/azsdk/net/servicebus/exceptions/troubleshoot.
18/03/2025, 21:47:17.370 ---> System.IO.IOException: The write operation failed, see inner exception.
18/03/2025, 21:47:17.370 ---> System.Net.Sockets.SocketException (32): Broken pipe
18/03/2025, 21:47:17.370 at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
18/03/2025, 21:47:17.370 --- End of stack trace from previous location ---
18/03/2025, 21:47:17.370 at System.Net.Security.SslStream.<WriteSingleChunk>g__CompleteWriteAsync|157_1[TIOAdapter](ValueTask writeTask, Byte[] bufferToReturn)
18/03/2025, 21:47:17.370 at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
18/03/2025, 21:47:17.370 --- End of inner exception stack trace ---
18/03/2025, 21:47:17.370 at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
18/03/2025, 21:47:17.370 at Microsoft.Azure.Amqp.Transport.TlsTransport.HandleOperationComplete(IAsyncResult result, Boolean write, Boolean syncComplete)
18/03/2025, 21:47:17.370 --- End of stack trace from previous location ---
18/03/2025, 21:47:17.370 at Microsoft.Azure.Amqp.AsyncResult.End[TAsyncResult](IAsyncResult result)
18/03/2025, 21:47:17.370 at Microsoft.Azure.Amqp.AmqpObject.OnSafeCloseComplete(IAsyncResult result)
18/03/2025, 21:47:17.370 --- End of stack trace from previous location ---
... repeated thousands of times...
18/03/2025, 21:47:17.761 --- End of stack trace from previous location ---
18/03/2025, 21:47:17.761 at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.RunOperation[T1,TResult](Func`4 operation, T1 t1, TransportConnectionScope scope, CancellationToken cancellationToken, Boolean logTimeoutRetriesAsVerbose)
18/03/2025, 21:47:17.761 at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.RunOperation[T1,TResult](Func`4 operation, T1 t1, TransportConnectionScope scope, CancellationToken cancellationToken, Boolean logTimeoutRetriesAsVerbose)
18/03/2025, 21:47:17.761 at Azure.Messaging.ServiceBus.Amqp.AmqpReceiver.ReceiveMessagesAsync(Int32 maxMessages, Nullable`1 maxWaitTime, CancellationToken cancellationToken)
18/03/2025, 21:47:17.761 at Azure.Messaging.ServiceBus.ServiceBusReceiver.ReceiveMessagesAsync(Int32 maxMessages, Nullable`1 maxWaitTime, Boolean isProcessor, CancellationToken cancellationToken)
18/03/2025, 21:47:17.761 at Azure.Messaging.ServiceBus.SessionReceiverManager.ReceiveAndProcessMessagesAsync(CancellationToken processorCancellationToken) msg: The session lock was lost. Request a new session receiver. (SessionLockLost). For troubleshooting information, see https://aka.ms/azsdk/net/servicebus/exceptions/troubleshoot.
18/03/2025, 21:47:17.761 Azure.Messaging.ServiceBus.ServiceBusException: The session lock was lost. Request a new session receiver. (SessionLockLost). For troubleshooting information, see https://aka.ms/azsdk/net/servicebus/exceptions/troubleshoot.
18/03/2025, 21:47:17.761 ---> Azure.Messaging.ServiceBus.ServiceBusException: The write operation failed, see inner exception. ErrorCode: Shutdown (ServiceCommunicationProblem). For troubleshooting information, see https://aka.ms/azsdk/net/servicebus/exceptions/troubleshoot.
18/03/2025, 21:47:17.761 ---> System.IO.IOException: The write operation failed, see inner exception.
18/03/2025, 21:47:17.761 ---> System.Net.Sockets.SocketException (32): Broken pipe
18/03/2025, 21:47:17.761 at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
18/03/2025, 21:47:17.761 --- End of stack trace from previous location ---
18/03/2025, 21:47:17.761 at System.Net.Security.SslStream.<WriteSingleChunk>g__CompleteWriteAsync|157_1[TIOAdapter](ValueTask writeTask, Byte[] bufferToReturn)
18/03/2025, 21:47:17.761 at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
18/03/2025, 21:47:17.761 --- End of inner exception stack trace ---
18/03/2025, 21:47:17.761 at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
18/03/2025, 21:47:17.762 at Microsoft.Azure.Amqp.Transport.TlsTransport.HandleOperationComplete(IAsyncResult result, Boolean write, Boolean syncComplete)
18/03/2025, 21:47:17.762 --- End of stack trace from previous location ---
18/03/2025, 21:47:17.762 at Microsoft.Azure.Amqp.AsyncResult.End[TAsyncResult](IAsyncResult result)
18/03/2025, 21:47:17.762 at Microsoft.Azure.Amqp.AmqpObject.OnSafeCloseComplete(IAsyncResult result)
... repeated thousands of times...
18/03/2025, 21:47:17.764 --- End of stack trace from previous location ---
18/03/2025, 21:47:17.764 at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.RunOperation[T1,TResult](Func`4 operation, T1 t1, TransportConnectionScope scope, CancellationToken cancellationToken, Boolean logTimeoutRetriesAsVerbose)
18/03/2025, 21:47:17.764 at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.RunOperation[T1,TResult](Func`4 operation, T1 t1, TransportConnectionScope scope, CancellationToken cancellationToken, Boolean logTimeoutRetriesAsVerbose)
18/03/2025, 21:47:17.764 at Azure.Messaging.ServiceBus.Amqp.AmqpReceiver.ReceiveMessagesAsync(Int32 maxMessages, Nullable`1 maxWaitTime, CancellationToken cancellationToken)
18/03/2025, 21:47:17.764 at Azure.Messaging.ServiceBus.ServiceBusReceiver.ReceiveMessagesAsync(Int32 maxMessages, Nullable`1 maxWaitTime, Boolean isProcessor, CancellationToken cancellationToken)
18/03/2025, 21:47:17.764 at Azure.Messaging.ServiceBus.SessionReceiverManager.ReceiveAndProcessMessagesAsync(CancellationToken processorCancellationToken)
18/03/2025, 21:47:18.160 fail: JETCharge.WebsocketGateway.Router[0]
18/03/2025, 21:47:18.160 chargepoint-websocket-gateway-service-1-556965bdb-abcd_device456_0f84854b Service Bus session error. Source: Receive ex: Azure.Messaging.ServiceBus.ServiceBusException: The session lock was lost. Request a new session receiver. (SessionLockLost). For troubleshooting information, see https://aka.ms/azsdk/net/servicebus/exceptions/troubleshoot.
18/03/2025, 21:47:18.160 ---> Azure.Messaging.ServiceBus.ServiceBusException: The write operation failed, see inner exception. ErrorCode: Shutdown (ServiceCommunicationProblem). For troubleshooting information, see https://aka.ms/azsdk/net/servicebus/exceptions/troubleshoot.
18/03/2025, 21:47:18.160 ---> System.IO.IOException: The write operation failed, see inner exception.
18/03/2025, 21:47:18.160 ---> System.Net.Sockets.SocketException (32): Broken pipe
18/03/2025, 21:47:18.160 at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
18/03/2025, 21:47:18.160 --- End of stack trace from previous location ---
18/03/2025, 21:47:18.160 at System.Net.Security.SslStream.<WriteSingleChunk>g__CompleteWriteAsync|157_1[TIOAdapter](ValueTask writeTask, Byte[] bufferToReturn)
18/03/2025, 21:47:18.160 at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
18/03/2025, 21:47:18.160 --- End of inner exception stack trace ---
18/03/2025, 21:47:18.160 at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
18/03/2025, 21:47:18.160 at Microsoft.Azure.Amqp.Transport.TlsTransport.HandleOperationComplete(IAsyncResult result, Boolean write, Boolean syncComplete)
18/03/2025, 21:47:18.160 --- End of stack trace from previous location ---
18/03/2025, 21:47:18.160 at Microsoft.Azure.Amqp.AsyncResult.End[TAsyncResult](IAsyncResult result)
18/03/2025, 21:47:18.160 at Microsoft.Azure.Amqp.AmqpObject.OnSafeCloseComplete(IAsyncResult result)
As you're probably already aware, the failure in the stack trace indicates a network-level problem outside of anything the client has insight into or influence over. Because this is observed by the AMQP transport, it is surfaced as an unknown AMQP exception, which the Service Bus library represents as ServiceBusException(ServiceBusFailureReason: ServiceCommunicationProblem) - or, more plainly "an unknown network exception happened talking to the service." Because this is a general error of unknown cause, it is marked as transient (as is every ServiceCommunicationProblem).
Lock renewal will continually retry for any non-terminal exception. The retry policy is applied with the requisite back-offs and delays and after all retries are exhausted, it will enter another renewal cycle where a small additional delay is applied and then a new set of attempts/retries takes place. Until the max duration passes, the lock is lost, or a non-terminal exception surfaces, the processor will keep trying to renew.
In this scenario, the network is in a bad sate, which means that the underlying AMQP connection and all links are broken. Because locks are associated with a specific AMQP link by the service, this means that your session lock is now orphaned until it ultimately expires, or the processor establishes a new link. The client, however, does not know this from lock renewal because lock renewal happens over a different AMQP link by service contract. (It is possible for the management link used for renewal to fail and the session to still be held.) The processor doesn't find out that the session is lost until an attempt is made to read the next batch of messages, at which point the client opens a new connection/link, which cleans up the state on the service releasing the orphaned lock.
Originally I expected to only get this via SessionClosingAsync and to do my cleanup there, but the docs for that handler do not indicate it would be called in this situation.
It won't. As mentioned, the client does not know the session is lost because lock renewal is separate from the session itself. There's a slight bit more context around this that I talked about here.
Hi @bramley-jetcharge. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.
Thanks for the info. The retry behaviour you describe is exactly what we want! We've done some more experimentation and here's where we're at:
Root Cause
You say the exceptions are a network-level problem surfacing from the underlying AMQP connection, and the lib can't see any further info. This is occurring every few hours to days for 3 k8s instances each with ~300 devices each. Each device is connected via a websocket, and each one gets its own ServiceBusSessionProcessor. When the issue occurs, it impacts only one or 2 instances, not all 3. The websocket connections also remain up (until the instance is restarted by k8s due to hitting resource limits), indicating its not a wider network issue.
- is it typical to have these network-level problems so frequently?
- if not, do you know how we can investigate further? We looked at SNAT port exhaustion - we get 64000 ports per VM, surely enough for ~900 devices
- It is interesting that the issue you linked to, raised straight after me, is also about AMQP connections being reset. I wonder if the cause is related
Mitigation
If we can't fix the root network issue, we have to mitigate the memory impact. You mentioned the memory pressure could be caused by a lot of queued messages or delays processing them.
We can reproduce the memory jump locally using just a single processor and a single message doing zero work (the repro steps above), and the usage stays high even once comms are restored.
We tested other configs, eg MaxAutoLockRenewalDuration = 1 SessionIdleTimeout = 1, MaxRetries 1, TryTimeout = 1 delay = 1, but nothing seems to make much difference to the memory spike.
Our service typically receives at most 1 msg per 30s per device. Msgs are no more than a few kb, and the processing 'work' is to simply send them out over the websocket to the device.
- Is this a bug?
Impact
When one k8s instance's memory usage doubles, it reaches the resource limit and is restarted. All of its device websocket connections get dropped, and those devices immediately try reconnecting. Our load balancer then directs them onto the remaining instances, increasing pressure on those, sometimes resulting in a cascading failure. Because the device websocket connections are long-lasting, they would stay on the over-burdened instance rather than being rebalanced once the impacted instance finishes restarting.
- How can we avoid this disaster without having to over provision every instance to be able to take the full 200% load (100% devices plus 100% AMQP-induced outage capacity)?
One very important thing to note is that you're running with an outdated version of the Service Bus library which depends on an older version of the AMQP transport. There were several critical fixes in the latest (v2.6.9) which were related to internal stalls. The first step for troubleshooting or capturing repro will be to upgrade to the latest version of Azure.Messaging.ServiceBus.
is it typical to have these network-level problems so frequently?
Not in my experience. This is not something that we see in our high volume, long-running stress tests, nor a scenario that developers report to us frequently.
Generally, when we do get reports of these kinds of issues, they tend to be in containerized environments and resolve to the host not having enough resources for the load that it is handling, either due to configuring constraints too tightly or having too many containers running at once. Just to be sure that I'm clear - this is completely anecdotal based on the folks that I've talked.
if not, do you know how we can investigate further? We looked at SNAT port exhaustion - we get 64000 ports per VM, surely enough for ~900 devices
At this point, the trouble seems to be related to the physical network or OS networking stack. I'd start by doing network analysis for routes to the host and other potential instabilities in the end-to-end system. But we're also outside of my area of expertise, so you're best taking me with a grain of salt and engaging with someone who specializes in network infrastructure.
It is interesting that the issue you linked to, raised straight after me, is also about AMQP connections being reset. I wonder if the cause is related
I have no insight, unfortunately. Connection reset could be initiated by the local host, the Service Bus service, or any number of network intermediaries along the way. Intercepting proxies, for example, rarely play nicely with the Azure messaging services.
The only thing that I can offer here is confirmation that neither the Service Bus clients nor the AMQP transport would initiate a connection reset.
We tested other configs, eg MaxAutoLockRenewalDuration = 1 SessionIdleTimeout = 1, MaxRetries 1, TryTimeout = 1 delay = 1, but nothing seems to make much difference to the memory spike. Our service typically receives at most 1 msg per 30s per device. Msgs are no more than a few kb, and the processing 'work' is to simply send them out over the websocket to the device.
Is this a bug?
Without insight into what the memory looks like at the time you're seeing a spike, I don't have enough context to do more than speculate. The processor itself is stateful, but that state mutates over time and doesn't trigger a ton of allocations when tracking. Generally, the most memory pressure when consuming messages comes from the AMQP transport, which has to track incoming messages and hold onto references for them until they're settled.
However, in this scenario, you've got auto complete enabled and are not using prefetch, so transport messages would only live so long as it takes to execute your processing handler and then disposition would take place and the transport messages disposed.
To comment further, I'd need to see a memory dump using the latest Azure.Messaging.ServiceBus package.
How can we avoid this disaster without having to over provision every instance to be able to take the full 200% load (100% devices plus 100% AMQP-induced outage capacity)?
What you're describing here is related to network and cluster configuration rather than use of the Service Bus library. This is out of my area of expertise and not something that the maintainers of the Azure SDK can assist with.
That said, if it were me, I'd start by looking at introducing awareness in the load balancer of when node health was degraded and experiencing a transient failure so that a delay can be applied before assigning the work. I'd rather trade a bit of latency in processing for stability of the cluster. However, if your application is overly sensitive to latency of this kind, then the right answer may be to overprovision.
Hi @bramley-jetcharge. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.
Thanks for your detailed support!
We updated to latest SB lib, but still see the issue when simulating network issues via the IP blocking. Again, we can still reproduce it with just one processor on a local PC with the repro steps in the OP. We still feel there is a memory issue in response to a potential network issue, so I'll work on getting you a memory dump.
The ServiceCommunicationProblem error continues to occur every few days (for minutes at a time) across our various services. Based on your advice we have raised a request with Azure k8s support to see if a network/config/vm issue is the root cause.
We also had a chat with Azure Service Bus support, who had a look on their side and found no server-side errors at the time of the outages (as expected). We also had a look at the 'user errors' metric, clearly spiking at the incident time.
We think most of the time outside of incidents, this is representing SessionCannotBeLockedException, something we expect to normally occur when our devices switch instances due to normal websocket closures on their end.
Hi @bramley-jetcharge. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.
Hi again,
We saw the latest library version 7.19.0 and ran some further load tests. Unfortunately the issue still occurs, but we do now see increased details in the ProcessErrorAsync exceptions, eg
Service Bus session error. Source: AcceptSession ex: System.UnauthorizedAccessException: Put token failed. status-code: 401, status-description: Ip has been prevented to connect to the endpoint.
This detail is new, and will hopefully reveal more info on the root cause once we put this update into prod. (the msg itself is of course because we use ip blocking to try to simulate the cause).
We captured memory dumps for this load test. Memory usage is still doubling. In this case, the memory usage spiked even higher, likely effected by the capture process.
-
Please let us know how we can securely send you these memory dumps (eg DM an email address and we can send a link)?
-
We have also shared this with our AKS support request. That team was also able to review our AKS network configuration, and found no relevant issues there.
@bramley-jetcharge: That's an interesting observation. I'd be very interested in seeing the full stack trace for that; I suspect that's a service issue. It may also be helpful to capture a +/- 5-minute slice of Azure SDK logs around the behavior. For the dump, please drop a note to my GH handle at Microsoft dot com.
Hi @bramley-jetcharge. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.
Hi @bramley-jetcharge, we're sending this friendly reminder because we haven't heard back from you in 7 days. We need more information about this issue to help address it. Please be sure to give us your input. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!
Clearing the flag for no recent activity. Dumps were shared and we're now waiting for support to grant access to the Azure SDK team for analysis.
Hi @bramley-jetcharge, we're sending this friendly reminder because we haven't heard back from you in 7 days. We need more information about this issue to help address it. Please be sure to give us your input. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!
Hi @bramley-jetcharge, we're sending this friendly reminder because we haven't heard back from you in 7 days. We need more information about this issue to help address it. Please be sure to give us your input. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!
Changing the designation; investigation and discussion taking place based on information already provided. Will update this issue with an outcome, once available.
To close out the issue for those interested, the investigation shifted to an email thread to allow for sharing of sensitive information and memory dumps. After dump analysis and several different approaches to reproduce, analysis concluded that there is no evidence of a memory leak. There were some patterns in the application and environment that contributed to increased memory pressure and guidance was provided to help mitigate. There is also an expected spike in memory use after prolonged network outage recovery, but use is appropriate and due to handling a large backlog of pending messages with all objects correctly rooted and being reclaimed after garbage collection.