node-amqp10
node-amqp10 copied to clipboard
ASB: client just stops receiving (node-amqp10 version 3.1.3)
On our production machine 3 times now, the receiver spontaneously stops working. New messages are delivered to the queue, but we never receive them from the ASB (receiver.on('message'... is not called anymore). No error, no nothing.
The first time it happened was after about a week uptime. After that it took 4 weeks before it stopped working and yesterday after only a couple of hours it happened again.
It can't be anything in our own code, because the receiver looks like this (simplified):
receiver.on('message', function(message) {
let link = this;
log(`start receive ${message.properties.messageId}`);
handleMessage(message);
link.accept(message);
log(`message ${message.properties.messageId} accepted`);
}
The log shows:
start receive xxxxx
message xxxxx accepted
And then when (even if it is only a minute later, so no timeout can cause the problem) another message is delivered to the queue it just stays there and nothing happens.
The big problem is that there is no error, so i cannot take action (in code or manually). I just have to keep an eye on the ASB queues and when it stops working i have to restart the application which solves the problem (because the connection is setup again).
Anyone else had this kind of problem?
Of course i could turn on DEBUG logging, but since it is not predictable when the problem occurs (it can take 4 weeks) and we process thousands of messages a day (4 topics for sending and 5 queues for receiving) in our application that would degrade performance i think and log files will be enormous.
I think cause of the problem must be in the ASB (where the connection just gets lost), because when the problem occurs we don't get the normal timeout after 10 minutes of inactivity (the connection would be setup again and the problem would be temporary (still not nice, but not as bad as it is now)).
I'll do some more research on this, but at the moment i don't have an idea how to handle this problem.
@gjr-qm we pushed a fix out in 3.2.1 that might help you: https://github.com/noodlefrenzy/node-amqp10/blob/master/CHANGELOG.md#bug-fixes-1. Specifically, we weren't making deep copies of the policy (in your case the SB policy), and so the outgoing window was being incremented and not reset on the creation of new clients.
This should technically only affect you if you are indeed using multiple clients in the same codebase. Are you doing that? Could you possibly share more of your code (in particular the bits around creating the client etc).
Additionally it would help to attach listeners to the errorReceived
events on the client, and created links, to see if you are actually getting any errors.
@mbroadst thanks for your response. We actually use multiple clients in the application. We do have a listener to the errorReceived
event.
Here is a more complete version of the code we use:
let _setupQueue = (name, sasKey, settings, onReceive) => {
Modules.logger.info(`Module asb: Starting ${name} queue...`);
let amqpQueueClient = new AMQPClient(Policy.Utils.RenewOnSettle(1, 1, Policy.ServiceBusQueue));
// Init from settings
let queueReceive = settings.queue;
let sasNameReceive = settings.sasname;
let uriReceive = protocol + '://' + encodeURIComponent(sasNameReceive) + ':' + encodeURIComponent(sasKey) + '@' + serviceBusHost;
// Connect and setup receiver
amqpQueueClient.connect(uriReceive)
.then(function() {
return amqpQueueClient.createReceiver(queueReceive)
})
.then(function(receiver) {
receiver.on('errorReceived', function(rx_err) {
if (rx_err.condition && (rx_err.condition == 'amqp:link:detach-forced')) {
Modules.logger.debug(`Timeout (${queueReceive})`);
}
else {
Modules.logger.error(`${queueReceive} ===> RX ERROR: `, rx_err);
}
});
receiver.on('message', function(message) {
let link = this;
Modules.logger.info(`Start receiver onMessage: ${message.properties.messageId}, queue: ${queueReceive}`);
try {
if (onReceive(message)) {
link.accept(message);
Modules.logger.info(`Message accepted: ${message.properties.messageId}, queue: ${queueReceive}`);
}
}
catch (error) {
Modules.logger.error(`${queueReceive} error: `, error);
}
});
})
.error(function(e) {
Modules.logger.warn('connection error: ', e);
});
}
};
So for every queue we call the setupQueue function with the relevant data/settings and a function that should be called when a message is received. The message is accepted if this function returns true. I think the rest of the code is self explanatory.
I will try the latest version of node-amqp10 (3.2.2) and hope the problem is solved in the changes you mentioned. I'll keep you posted!
@gjr-qm thanks, and I hope so too! The client was originally designed mostly with the intention of using a single client (we have tools like amqp10-link-cache
to help manage tons of similar link connections as well), since AMQP 1.0's preferred behavior is to multiplex over a single socket.
Having said that we do support multiple clients. This issue slipped through the cracks mostly because I'm the one doing most of the testing and rarely use multiple clients, so my apologies and hopefully we can catch these before shipping in the future.
@mbroadst i was not aware of the amqp10-link-cache
module. I will look into that too, because at the moment we use our own caching of sender links (for the topics). Caching receiver links (for the queues) is nog something we need, because the listeners for each queue should run always during the lifetime of the application.
Updating to the last version of node-amqp10 didn't solve the problem. We had 2 more occurences of queues that just stopped. Could it be because it has something to do with the throttling we use on the receiver links? We got the throttling implementation from the documentation. Maybe the credits are not added in the following code (it would make sense that a queue stops receiving the moment it runs out of credits):
{
receiverLink: {
credit: function (link, options) {
// If the receiver link was just connected, set the initial link credit to the quantum. Otherwise, give more credit for every message we've settled.
var creditQuantum = (!!options && options.initial) ? link.policy.creditQuantum : link.settledMessagesSinceLastCredit;
if (creditQuantum > 0 && link.linkCredit < threshold) {
link.addCredits(creditQuantum);
}
}
}
}
Maybe a situation that occurs earlier in the process messes up the options, settledMessagesSinceLastCredit, linkCredit or threshold?
What would be the best way to check this?
hrm sorry to hear that, there definitely was an error but maybe we're working against a few issues here. Unfortunately the best way to check what's going on is going to be grabbing some debug logs and/or printing out important values at critical junctures (such as the credit function you posted above).
Can you easily reproduce this issue, or is it another long-running issue?
The issue is not reproducible unfortunately...
One time the service runs for 4 weeks without problems, sometimes it takes a couple of days. And we never had it on our development or test-environment. I think i will put some extra logging in, because i'm afraid that the debug functionality will cause other issues because it's so intense.
I also have this problem in version 3.4.0. Has anyone looked into it?
@magnusjt the ticket has lagged a little bit due to lack of data and difficulty in reproduction. Do you happen to have debug logs for your failure?
I have some debug logs now, but it's hard to digest it all. It seems to start with a disconnect:
02:54:55.641 debug connection error: Error: read ECONNRESET 02:54:55.641 debug Error from socket: Error: read ECONNRESET 02:54:55.642 debug disconnected 02:54:55.643 debug Transitioning from OPENED to DISCONNECTED due to error
This is followed up by detaching all links (subscriptions). The links that fail are different in the current state:
This one seems to attach back fine after a while:
force detach for mylink1/Subscriptions/mytopic_41c63f8d-771a-4f81-bc85-fc6ded0ac929. current state: attached mylink1/Subscriptions/mytopic_41c63f8d-771a-4f81-bc85-fc6ded0ac929:3: Transitioning from ATTACHED to DETACHED due to forceDetach
This one does not: force detach for mylink2/Subscriptions/mytopic_10da943e-44bf-4afe-b576-f33052e3c7c2. current state: attaching
The last one does not get to the "Transitioning from..." message at all. I'm still not sure why the link was in the "attaching" state from before.
I tried to dig into the code, but it seems you have already refactored some of the relevant code. Will the issue be fixed in the next version?
@magnusjt I've been trying to work through similar issues for the 3.5.1 release, but I've got very limited time atm. It would be incredibly helpful if you could test against master
as it is today to see if your issues are resolved. Also if you could somehow provide me with a full log of your case it would better help me model your problem (as I have no personal access to SB/EH). Thanks.
Ok, I'll try master :) If it doesn't work out I'll see if I can grab the debug logs.
Stuck again. Not sure if it's the same issue, but it seems to be stuck in a loop trying to connect:
15:03:21.340 debug connecting to: <my_url> 15:03:21.345 debug Connecting to <my_url> via TLS 15:03:21.346 debug stateChange: DISCONNECTED => START , reason: connect 15:03:21.520 debug Sending Header <my_header> 15:03:21.521 debug stateChange: START => IN_SASL , reason: connected 15:03:21.564 debug Rx: <my_rx> 15:03:21.565 debug Server SASL Version: 414d515003010000 vs 414d515003010000 15:03:21.645 debug Rx: <my_rx> 15:03:21.646 debug <= [object Object] 15:03:21.647 debug raw: <my_raw> 15:03:21.647 debug Sending <my_key> 15:03:21.648 debug => [object Object] 15:03:21.649 debug raw: <my_raw> 15:03:21.692 debug Rx: <my_rx> 15:03:21.693 debug raw: <my_raw> 15:03:21.693 debug <= [object Object] 15:03:21.694 debug Sending Header <my_header> 15:03:21.694 debug stateChange: IN_SASL => HDR_SENT , reason: success 15:03:21.737 debug Rx: <my_rx> 15:03:21.738 debug Server AMQP Version: 414d515000010000 vs 414d515000010000 15:03:21.738 debug stateChange: HDR_SENT => HDR_EXCH , reason: validVersion 15:03:21.739 debug raw: <my_raw> 15:03:21.739 debug => [object Object] 15:03:21.740 debug stateChange: HDR_EXCH => OPEN_SENT , reason: sendOpen 15:03:21.751 debug on(end) 15:03:21.780 debug Rx: <my_rx> 15:03:21.781 debug <= [object Object] 15:03:21.782 debug raw: <my_raw> 15:03:21.782 debug stateChange: OPEN_SENT => OPENED , reason: openReceived 15:03:21.783 debug Setting heartbeat check timeout to 30000 15:03:21.783 debug Connected with params { maxFrameSize= 65536 , channelMax= 4999 , idleTimeout= 120000 , remoteIdleTimeout= 240000 } 15:03:21.784 debug connected 15:03:21.785 debug session already exists, re-using 15:03:21.786 debug stateChange(undefined): UNMAPPED => BEGIN_SENT , reason: sendBegin 15:03:21.787 debug raw: <my_raw> 15:03:21.787 debug => [object Object] 15:03:21.827 debug Rx: <my_rx> 15:03:21.828 debug <= [object Object] 15:03:21.829 debug stateChange(undefined): BEGIN_SENT => MAPPED , reason: beginReceived 15:03:21.829 debug raw: <my_raw> 15:03:21.830 debug On BEGIN_RCVD, setting params to (1,5000,100,255) 15:03:21.831 debug mapped 15:04:21.743 debug Rx: <my_rx> 15:04:21.744 debug raw: <my_raw> 15:04:21.744 debug <= [object Object] 15:04:21.745 debug stateChange(0): END_RCVD => UNMAPPED , reason: sendEnd 15:04:21.745 debug stateChange(0): MAPPED => END_RCVD , reason: endReceived 15:04:21.746 debug => [object Object] 15:04:21.746 debug raw: <my_raw> 15:04:21.747 debug Session unmapped - force-detaching all links. 15:04:21.747 debug unmapped 15:04:21.784 debug Rx: <my_rx> 15:04:21.785 debug raw: <my_raw> 15:04:21.785 debug <= [object Object] 15:04:21.787 debug connection error: AmqpProtocolError: amqp:connection:forced:The connection was inactive for more than the allowed 60000 milliseconds and is closed by container 'LinkTracker'. 15:04:21.787 debug stateChange: OPENED => CLOSE_RCVD , reason: closeReceived 15:04:21.789 debug raw: <my_raw> 15:04:21.789 debug => [object Object] 15:04:21.790 debug disconnected 15:04:21.791 debug stateChange: CLOSE_RCVD => DISCONNECTED , reason: sendClose 15:04:22.792 debug connecting to: <my_url>
.. and so on
@gjr-qm it looks like the heartbeat timer isn't sending heartbeat frames. Can you check to see if this code is being run: https://github.com/noodlefrenzy/node-amqp10/blob/master/lib/connection.js#L474-L478 ?
It doesn't seem to be run. I can't see any empty frames after the issue started.
@magnusjt sorry I accidentally responded to @gjr-qm ... Anyway, can you confirm that the function within the timeout is in fact being called? My guess is that something on this line is causing the if statement to bail out without sending the heartbeat frame. It might be enough to just print out each of those conditions individually and see where the breakdown occurs.
There are 20-30 days between each time this is happening, so it won't help adding a few random debug statements. There has to be something else going wrong here beside the timeout. There is a full minute where nothing is happening:
15:03:21.829 debug stateChange(undefined): BEGIN_SENT => MAPPED , reason: beginReceived 15:03:21.829 debug raw: <my_raw> 15:03:21.830 debug On BEGIN_RCVD, setting params to (1,5000,100,255) 15:03:21.831 debug mapped --> what's going on here? 15:04:21.743 debug Rx: <my_rx>
Usually I see topics (links) start attaching after this, but now nothing. Maybe the links have been removed from the session somehow? And what about this line?
15:03:21.785 debug session already exists, re-using
@magnusjt this seems to be a pretty clear case of the heartbeat not being sent to the server, but perhaps something else is causing the delay. What I can see in the logs is that:
- the session maps properly (reusing the existing session, here)
- it correctly negotiates the heartbeat time correctly (30000ms in this case)
- nothing happens after remapping, including sending the timeout frame - which then leads to a remote disconnect
The question of course being "why didn't it send a heartbeat frame if everything else was set up properly?" A few possibilities occur to me off the top of my head:
- some local state information is incorrect (hence my suggestion to check those variables)
- something has been leaking, or is now generally blocking the event loop (perhaps its possible to check if memory is growing unbounded on the instances running this code?)
Do you have logs I can look at that involve a few reconnects (from working to not working)? I'm not sure it will help all that much, but at this point it seems we need to look for patterns.
Hi again. I added a few more debug statements. What's weird is this:
We negotiate a timeout of 120 000 ms: Connected with params { maxFrameSize= 65536 , channelMax= 4999 , idleTimeout= 120000 , remoteIdleTimeout= 240000 }
Then suddenly it complains about being inactive for 60 000 ms:
AmqpProtocolError: amqp:connection:forced:The connection was inactive for more than the allowed 60000 milliseconds and is closed by container 'LinkTracker'.
The heartbeat check in the code checks the "maxTimeBeforeHeartbeat" variable which is set to 120 000ms.
if (self.connected && self._transport && (Date.now() - self._lastOutgoing) > maxTimeBeforeHeartbeat) {
This means that the first heartbeat sent can only occur after 120 000 ms, which is twice that in the error message.
So the question is, why is there a timeout of 60 000 ms in the first place, when we negotiated 120 000? Perhaps this is something do to with azure?
@mbroadst forgot to mention you in the previous post
@mbroadst Finally got another error with debug statements on:
!!self.connected -> true !!self._transport -> true (Date.now() - self._lastOutgoing) -> 30027 maxTimeBeforeHeartbeat -> 120000
Memory usage is normal.
any updates on this? I am facing the exact same issue.
we're getting
{ Error: read ECONNRESET at _errnoException (util.js:1022:11) at TLSWrap.onread (net.js:615:25) code: 'ECONNRESET', errno: 'ECONNRESET', syscall: 'read' }
or { Error: read ETIMEDOUT at _errnoException (util.js:1022:11) at TLSWrap.onread (net.js:615:25) code: 'ETIMEDOUT', errno: 'ETIMEDOUT', syscall: 'read' }
with ASB but the error seems to be thrown 12 minutes (or more) after it stops receiving messages. We're using v3.5.3. Does this seem related?
@njLT Some fixes were made in the 3.6.0 version of node-amqp10. Try using that. I still believe this looks like a service side issue since you were getting this error while using the http based azure-sb
library and this library as well.
@njLT, we are running into the same issue with version 3.5.3. Did upgrading to 3.6.0 solve it for you?
Sounds like you were having a similar problem with Microsoft's REST based azure-sb library, so just curious.
@johnrhampton we faced a bigger issue with 3.6.0, just one consumer works at a time and that too stops working after a while. you might want to checkout https://github.com/amqp/rhea-promise which is what the guys at MS now recommend, we stopped using iothub altogether.