azure-sdk-for-go icon indicating copy to clipboard operation
azure-sdk-for-go copied to clipboard

[azservicebus] Receiver indefinitely "stuck" after a long idle period

Open tkent opened this issue 2 years ago • 2 comments

Bug Report

After a long period of inactivity, a receiver will stop receiving new messages. A "long period" is sometime more than 8 hours and less than 13 days, but exactly how long is unknown.

(The problem was originally brought up in this comment)

This is very straightforward to demonstrate if you are willing to wait and setup a dedicated service bus instance. It occurs frequently with infrastructure used for QA, since that will often not receive any activity over weekends and holidays.

SDK Versions Used

I have seen this behavior across many versions of the azure-sdk-for-go, but the most recent test was conducted using these versions:

github.com/Azure/azure-sdk-for-go/sdk/azcore v1.1.0
github.com/Azure/azure-sdk-for-go/sdk/azidentity v1.1.0
github.com/Azure/azure-sdk-for-go/sdk/messaging/azservicebus v1.0.1

About the most recent time this was reproduced

We most recently reproduced this by running a small golang app in an AKS cluster using a managed identity assigned by aad-pod-identity.

In this test, we setup a dedicated azure service bus + managed identity (terraform below) and let the app run. After 13 days, we came back to it . No errors had been emitted, just the regular startup message for the app. I then entered a message into the bus. The receiver in the app did not pickup the message after waiting 30 minutes. We deleted the pod running the app and allowed it to be recreated by the deployment. The replacement pod immediately picked up the message.

Workaround

We can work around this issue by polling for messages using a 10 minute timeout and restarting in a loop. Our workaround looks like this and is known to work for weeks without an issue.

rcvrCtxForSdkWorkaround, canceller := context.WithTimeout(ctx, 10*time.Minute)
messages, err := azsbReceiver.ReceiveMessages(rcvrCtxForSdkWorkaround, 1, nil)
canceller()
if err != nil && !errors.Is(err, context.DeadlineExceeded) {
	r.logger.Info(EvtNameErrRetrievingMsgs, map[string]string{
		"error": err.Error(),
	})
	continue
}
// This just means that the context was closed before any messages
// were picked up. This could have been the context (master or)
// rcvrCtxForSdkWorkaround, so a loop is required.
if len(messages) == 0 {
	continue
}

The terraform for the test bus

The terraform below was used to setup the test bus and assign the app identity access to it.

resource "azurerm_servicebus_namespace" "ns" {
  name                = "fixturens${local.deploy_token}"
  location            = local.location
  resource_group_name = data.azurerm_resource_group.rg.name
  sku                 = "Standard"
  tags                = local.standard_tags
}

resource "azurerm_servicebus_queue" "test" {
  name                                 = "test"
  namespace_id                         = azurerm_servicebus_namespace.ns.id
  dead_lettering_on_message_expiration = false
  enable_partitioning                  = false
  default_message_ttl                  = "PT48H"
}

resource "azurerm_role_assignment" "full_access_ra" {
  for_each     = local.authorized_authorized_principal_ids_as_map
  scope        = azurerm_servicebus_queue.test.id
  principal_id = each.value
  role_definition_name = "Azure Service Bus Data Owner"
}

tkent avatar Jul 01 '22 05:07 tkent

Hi @tkent, thank you for filing this issue. I know it's frustrating to deal with a bug, so I appreciate you working with me on this.

We have tests for these kinds of scenarios but clearly, since you're seeing a bug, I'm missing something. I'll see what I'm mising there.

richardpark-msft avatar Jul 01 '22 17:07 richardpark-msft

@richardpark-msft - Hey, I appreciate you looking into it. Frustrating, yes, but it would be much more frustrating if we didn't have a work around or we filed an issue that gets dismissed/ignored.

Priority wise, since we have a work around it's not high on our list. That said, I'd imagine others don't want to have to go through the learning process on this one.

tkent avatar Jul 01 '22 17:07 tkent

Hey @tkent , I added in a client-side idle timer that does something similar to what you outlined above. It recycles the link if nothing is received for 5 minutes, under the covers. It was released in azservicebus 1.1.2

Closing this now as we've formally implemented something similar to your workaround :).

This should help combat a situation I've been worried about for a bit. If the server idles out our link or detaches it and we miss it then our link will still look alive during these quiet times, even if it's never going to work. We now close out the link and attempt to recreate it, which will force a reconciling between the service and client.

richardpark-msft avatar Nov 09 '22 21:11 richardpark-msft

Reopening as there's still work for this.

richardpark-msft avatar Nov 18 '22 01:11 richardpark-msft