azure-sdk-for-js icon indicating copy to clipboard operation
azure-sdk-for-js copied to clipboard

Service Bus v7.4.0(the latest version) hangs receiving messages

Open winterAndvelvet opened this issue 3 years ago • 22 comments

  • Package Name: azure/service-bus
  • Package Version: 7.4.0
  • Operating system:linux debian
  • [ ] nodejs
    • version: 14.15.0
  • [ ] browser
    • name/version:
  • [ ] typescript
    • version:
  • Is the bug related to documentation in
    • [ ] README.md
    • [ ] source code documentation
    • [ ] SDK API docs on https://docs.microsoft.com

Describe the bug receiveMessages hang

To Reproduce Steps to reproduce the behavior:

code like:
  this.receiver = this.sbClient.createReceiver(this.queueName, {
                receiveMode: 'peekLock',
                maxAutoLockRenewalDurationInMs: 0 //set this to 0 to disable auto lock renew! because it's not work as expected!
            });;
  while (true) {
            try {
                // NOTE: asking for 20 messages does not guarantee that we will return
                // all 20 at once so we must loop until we get all the messages we expected.
                /**
                * @throws Error if the underlying connection, client or receiver is closed.
                * @throws Error if current receiver is already in state of receiving messages.
                * @throws `ServiceBusError` if the service returns an error while receiving messages.
                 */
                const messages = await this.receiver.receiveMessages(messageCount, {
                    maxWaitTimeInMs: 2 * 1000,
                });
                if (!messages.length) {
                    logger.debug('queue is empty');
                    continue;
                }
                messages.map(msg=>onMessageHandler(msg));
            } catch (err) {
                await onErrorHandler(new QueueError(err));
            }
        }

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here.

winterAndvelvet avatar Feb 09 '22 01:02 winterAndvelvet

Thanks for reporting @winterAndvelvet

While we look to replicate the issue, I noticed that you are calling your own message and error handlers. If you already have these handlers set up, can you share why you chose to use the receiveMessages() method in a loop instead of the subscribe() method that will take your handlers?

ramya-rao-a avatar Feb 09 '22 18:02 ramya-rao-a

For concurrency control eaily,we choosed the polling message style.In the first few hours,it worked well,but get message hang when queue incoming message increase to 4k.I don't konw if hang is related with message incoming count in queue.In addition,i tied to call receiveMessages with timeout.But when receiveMessages timeout ,it throw "The receiver is already receiving messages" at the next loop step.This also can proved receiveMessages is really hang. Code Like:

 async startGetMessageLoop(onMessageHandler, onErrorHandler, options){
        let {trafficController, getMsgTimer}= options;
        while (true) {
            try {
                if(global.isReceiveExit){
                    logger.warn('receive sigterm,break rec msg loop');
                    break;
                }
                await trafficController.waitTrafficBlockLeave();
                let messageCount = trafficController.nextTrafficWindow();
                logger.debug(`next traffic window:${messageCount}`);
                // NOTE: asking for 20 messages does not guarantee that we will return
                // all 20 at once so we must loop until we get all the messages we expected.
                /**
                * @throws Error if the underlying connection, client or receiver is closed.
                * @throws Error if current receiver is already in state of receiving messages.
                * @throws `ServiceBusError` if the service returns an error while receiving messages.
                 */
                //bug:https://github.com/Azure/azure-sdk-for-js/issues/20274
                const messages = await Promise.race([this.receiver.receiveMessages(messageCount, {
                    maxWaitTimeInMs: 2 * 1000,
                }),new Promise((_, reject) => {
                    setTimeout(() =>{
                         reject(new Error('receive timeout'))}, 5000);
                 })]);
                if (!messages.length) {
                    logger.debug('queue is empty');
                    continue;
                }
                messages.map(msg=>onMessageHandler(msg));
            } catch (err) {
                if(err.message === 'receive timeout'){
                    logger.error(`receive msg timeout,continue`);
                    continue;
                }
                await onErrorHandler(new QueueError(err));
            }
            await sleep(getMsgTimer);
        }
    } 

image

winterAndvelvet avatar Feb 10 '22 02:02 winterAndvelvet

By the way,when use:

await this.receiver.receiveMessages(messageCount, {
                    maxWaitTimeInMs: 2 * 1000,
                })

also get current receiver is already in state of receiving messages exception,the maxWaitTimeInMs not work?

winterAndvelvet avatar Feb 10 '22 02:02 winterAndvelvet

@winterAndvelvet thank you for the detailed information! I will look into this.

jeremymeng avatar Feb 10 '22 18:02 jeremymeng

@jeremymeng thank you

winterAndvelvet avatar Feb 11 '22 06:02 winterAndvelvet

@winterAndvelvet You are right that the "isReceivingMessages" flag is only cleared at the end of the receiveMessages() call. I would expect maxWaitTimeInMs to work but from the above code and logs I could not see what went wrong. Is it possible that you enable more logging and see if anything unusual/erroneous when the issue starts happening? Please refer to the troubleshooting section for more information on getting debug logs:

https://github.com/Azure/azure-sdk-for-js/blob/main/sdk/servicebus/service-bus/README.md#troubleshooting

It is probably useful to start with this configuration:

export DEBUG=azure*,rhea*,-rhea:raw,-rhea:message,-azure:core-amqp:datatransformer

jeremymeng avatar Feb 23 '22 01:02 jeremymeng

@winterAndvelvet while we are figuring out the root cause of the issue, one workaround is to pass an abort signal to the receiveMessages() call then handle the error:

import { AbortController } from "@azure/abort-controller";

// ...
        const messages = await this.receiver.receiveMessages(messageCount, {
          maxWaitTimeInMs: 3 * 1000,
          abortSignal: AbortController.timeout(2 * 1000),
        });
// ...
      } catch (err) {
        if (err.name === "AbortError") {
          console.log("2 seconds passed, abort. will try receiving again");
          continue;
        }
        await onErrorHandler(new Error(err));
      }

jeremymeng avatar Feb 24 '22 19:02 jeremymeng

@winterAndvelvet have you got any chance to enable more detailed logging or try out the workaround using abort signal?

jeremymeng avatar Mar 09 '22 22:03 jeremymeng

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

ghost avatar Mar 30 '22 20:03 ghost

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

ghost avatar Apr 14 '22 20:04 ghost

Hello. I'm having this same issue. the receiver.receiveMessages freezes if it tries to receive more than 2000 messages at once, and the maxWaitTimeInMs is not respected. Therefore it hangs and requires a restart. It works fine when receiving less than 2000 messages.

it also freezes if I batch process 200 messages at a time, until 2000 are requested. Once 2000 are locked, it will not receive any more messages from the queue (the 2000 limit is approximately when the issue occurs, might be more or less in reality)

Receiving messages...
Received 250 messages
Not enough messages to process yet, waiting for 4446 total messages
Processing all messages took: 0.493ms
Received 250 messages
Not enough messages to process yet, waiting for 4446 total messages
Processing all messages took: 0.339ms
Received 250 messages
Not enough messages to process yet, waiting for 4446 total messages
Processing all messages took: 0.357ms
Received 250 messages
Not enough messages to process yet, waiting for 4446 total messages
Processing all messages took: 0.324ms
Received 250 messages
Not enough messages to process yet, waiting for 4446 total messages
Processing all messages took: 0.384ms
Received 250 messages
Not enough messages to process yet, waiting for 4446 total messages
Processing all messages took: 0.319ms
Received 250 messages
Not enough messages to process yet, waiting for 4446 total messages
Processing all messages took: 0.336ms
Received 250 messages
Not enough messages to process yet, waiting for 4446 total messages
Processing all messages took: 0.342ms
  Error [AbortError]: The operation was aborted.
    at new AbortError
 Finally called

Here you can see that it receives the messages in a loop perfectly fine, until exactly 2000 have been received. At this point, the receiver.receiveMessages hangs and the abort controller is called.

Code:

import { AbortController } from "@azure/abort-controller";

let messageBatchSize = 250;
let messagesRequiredToStartProcessing = 4446;
let allMessages = [];

try {
            const messages = await receiver.receiveMessages(messageBatchSize, {
                maxWaitTimeInMs: 0.5 * 1000,
                abortSignal: AbortController.timeout(10 * 1000),
            });

            console.time(`Processing all messages took`);

            console.log(`Received ${messages.length} messages`);
            allMessages.push(...messages);

            if (allMessages.length < messagesRequiredToStartProcessing) {
                // Don't process until exactly `messageBatchSize` messages are received
                console.log(`Not enough messages to process yet, waiting for ${messagesRequiredToStartProcessing} total messages`);
                console.timeEnd(`Processing all messages took`);
                continue;
            }
    } catch (error) {
        console.log("Assigned Seating Queue Error: ", error)
    }    {
        console.log('Assigned Seating Queue: Finally called');
}

CodeJuggernaut avatar Apr 20 '22 17:04 CodeJuggernaut

@CodeJuggernaut Thank you very much for the logs and repro code. I will have another look this week. Is it possible for you to enable more details logging and share the logs when this issue happens by setting the following environment variable before running the test?

export DEBUG=azure*,rhea*,-rhea:raw,-rhea:message,-azure:core-amqp:datatransformer

jeremymeng avatar Apr 20 '22 17:04 jeremymeng

Hi, thanks for the fast reply. I enabled that debug logs and its printing out 10,000s plus lines of debug logs... which ones should I share?

Is there a secure way to share these logs with you?

CodeJuggernaut avatar Apr 20 '22 17:04 CodeJuggernaut

@CodeJuggernaut It would be helpful to see logs around when the test appears stuck then the call is aborted. Let me think about the secure way of sharing.

jeremymeng avatar Apr 20 '22 18:04 jeremymeng

@CodeJuggernaut it looks like there's some internal limitation of around 2000 messages that we can handle before they are processed (complete/abandan/defer/deadleter). I will try to find out more but as a workaround, can you lower the number to 2000, and ensure they are processed before receiving the next 2000?

let messagesRequiredToStartProcessing = 2000;

jeremymeng avatar Apr 22 '22 18:04 jeremymeng

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

ghost avatar Apr 29 '22 20:04 ghost

@CodeJuggernaut just an update that this is actually a known issue https://github.com/Azure/azure-sdk-for-js/issues/11633. While we are working on a solution. I wonder whether the above workaround would work for you?

jeremymeng avatar May 13 '22 18:05 jeremymeng

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

ghost avatar May 24 '22 20:05 ghost

Hello @jeremymeng

the workaround that pass an abort signal to the receiveMessages() call not work. Afterthe receiveMessages() call abort, the receiveMessages() call hang again soon, as shown in the following figure. I feel I need to create a new receiver.

image

zfyseu avatar Oct 14 '22 10:10 zfyseu

@zfyseu is it possible for you to try the other mentioned workaround of processing at most 2000 messages in a batch?

jeremymeng avatar Oct 14 '22 16:10 jeremymeng

@jeremymeng My receiver only receives one message at a time.

zfyseu avatar Oct 16 '22 08:10 zfyseu

@zfyseu ah sorry I thought you were having the batching issue. Is it possible for you to turn on more detailed logging and share the logs when the SDK stops receiving messages?

export DEBUG=azure*,rhea*,-rhea:raw,-rhea:message,-azure:core-amqp:datatransformer

jeremymeng avatar Oct 17 '22 19:10 jeremymeng

@jeremymeng OK, I'm a little busy these days. I'll collect the logs later.

zfyseu avatar Oct 20 '22 01:10 zfyseu

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

ghost avatar Oct 27 '22 02:10 ghost

Hi @jeremymeng , The time of the receiver hang is around 2022-10-26T17:31:40 UTC time. I take a screenshot of some logs near the time.

debugLogSnipaste_001

debugLogSnipaste_002

debugLogSnipaste_003

debugLogSnipaste_004

debugLogSnipaste_005

debugLogSnipaste_006

debugLogSnipaste_007

debugLogSnipaste_008

zfyseu avatar Oct 27 '22 05:10 zfyseu

@zfyseu Thank you for sharing the log. Just to confirm, this is with the abortSignal workaround? Do you see my logs for "[connection-8]" afterwards?

jeremymeng avatar Oct 28 '22 19:10 jeremymeng

@jeremymeng Yes, this is with the abortSignal workaround. And the AbortSignal is 5s after maxWaitTimeInMs.

image

zfyseu avatar Oct 29 '22 02:10 zfyseu

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

ghost avatar Nov 05 '22 14:11 ghost

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

ghost avatar Nov 28 '22 20:11 ghost

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

ghost avatar Dec 06 '22 02:12 ghost