node-amqp10 icon indicating copy to clipboard operation
node-amqp10 copied to clipboard

Failures in `SenderLink._sendMessage` are not buffered as `pendingSends`

Open geoffreysamper opened this issue 8 years ago • 45 comments

Some times when my server is under load and the connection is lost then I recveive this exception. When the connection is restored everything works fine

there was an error during sending  TypeError: Cannot read property 'write' of null
    at Object.frames.writeFrame (C:\github\profiling-tracker-api\node_modules\amqp10\lib\frames.js:56:9)
    at Connection.sendFrame (C:\github\profiling-tracker-api\node_modules\amqp10\lib\connection.js:328:10)
    at SenderLink._sendMessage (C:\github\profiling-tracker-api\node_modules\amqp10\lib\sender_link.js:202:29)
    at sendMessage (C:\github\profiling-tracker-api\node_modules\amqp10\lib\sender_link.js:120:28)
    at C:\github\profiling-tracker-api\node_modules\amqp10\lib\sender_link.js:143:5
    at SenderLink.send (C:\github\profiling-tracker-api\node_modules\amqp10\lib\sender_link.js:113:10)
    at C:\github\profiling-tracker-api\lib\eventHub.js:78:27
    at runCallback (timers.js:637:20)
    at tryOnImmediate (timers.js:610:5)
    at processImmediate [as _immediateCallback] (timers.js:582:5)
[2017-02-16 12:58:58] [LOG]   TypeError TypeError: Cannot read property 'write' of null
    at Object.frames.writeFrame (C:\github\profiling-tracker-api\node_modules\amqp10\lib\frames.js:56:9)
    at Connection.sendFrame (C:\github\profiling-tracker-api\node_modules\amqp10\lib\connection.js:328:10)
    at SenderLink._sendMessage (C:\github\profiling-tracker-api\node_modules\amqp10\lib\sender_link.js:202:29)
    at sendMessage (C:\github\profiling-tracker-api\node_modules\amqp10\lib\sender_link.js:120:28)
    at C:\github\profiling-tracker-api\node_modules\amqp10\lib\sender_link.js:143:5
    at SenderLink.send (C:\github\profiling-tracker-api\node_modules\amqp10\lib\sender_link.js:113:10)
    at C:\github\profiling-tracker-api\lib\eventHub.js:78:27
    at runCallback (timers.js:637:20)
    at tryOnImmediate (timers.js:610:5)
    at processImmediate [as _immediateCallback] (timers.js:582:5)

geoffreysamper avatar Feb 16 '17 15:02 geoffreysamper

We are seeing something very similar but with different outcomes (we're using 3.4.1). Once the queues get messed up they detach and reattach. Sometimes we see traces like the OP. Other times it is the sendFrame() call in _sendMessage that is that fails as the connection is null.

More fundamentally, we end up in a state where the promises returned from senderLink.send() are not settling. As a result, the calling service just hangs. This seems to happen when calling send() while the link is in "state === 'attaching')

We poked about in the amqp code and tried a fix. See this fork. I'm not claiming that it is good/works but wanted to get a sense if this is the right direction.

I wanted to create a repro for testing but that would be hard. While we seem to get a lot of detach/reattach cycles with Azure service bus, it also seems to take a while to for the problem to show up.

jeffmcaffer avatar Feb 25 '17 01:02 jeffmcaffer

@jeffmcaffer How I could simulate my problem is to load test my server and ensure that the cpu is saturated and then try to run to force a close connection. repo

geoffreysamper avatar Feb 27 '17 10:02 geoffreysamper

@jeffmcaffer @geoffreysamper my hunch here is that when detatch/closing the connection (either manually, or forced) we do not clean up some local bookkeeping that would otherwise indicate that we cannot send data (SenderLink.prototype.canSend), which would otherwise queue your messages. I'll try to put together a unit test exhibiting this around lunch time

mbroadst avatar Feb 27 '17 14:02 mbroadst

Thanks @mbroadst. Really appreciate you help here. We are standing up a production service and hitting this a lot.

I noticed a few things that might help figuring this out.

  • typically I see a mess of errors like the one that started this issue THEN I see the link detach and attach
  • We switched from multiple AMQPClient objects (one per queue) to just one client that is reused across 5 different queues. That did not seem to change the behavior.
  • There may be two (un/semi) related issues here. The exceptions noted here and a failure to settle promises. The promise issue is that sometimes we successfully get back a promise from the send() call and the promise is never settled. It seems to happen when there are errors but not always but not for all errors.
  • The changes in my fork, while not necessarily correct, do point to an issue with _sendMessage failing. That is, if _sendMessage throws, it is unclear how the returned promise gets settled. In the !canSend() case the we "queue up" the sendMessage() call. That call can and does throw out at line 127 (if _sendMessage throws) when called at line 248
  • I wonder if this is an issue in other places, for example, sending accept and release
  • I've been running with a larger credit limit on the sender link and a rate limiter on calls to send and it is better. Still see errors but fewer lost promises. FWIW, my credit is 100 and the ratelimiter is set to 100 per second per process.

I've started seeing forceDetach issue as well. See trace below. There is no pattern I can see and these occur on the receiver link too.

AmqpProtocolError:
   at SenderLink.forceDetach (at SenderLink.forceDetach (D:\home\site\wwwroot\node_modules\amqp10\lib\sender_link.js:283:30): D:\home\site\wwwroot\node_modules\amqp10\lib\sender_link.js: 283)
   at detachHandler (at detachHandler (D:\home\site\wwwroot\node_modules\amqp10\lib\session.js:441:41): D:\home\site\wwwroot\node_modules\amqp10\lib\session.js: 441)
   at Session._unmap (at Session._unmap (D:\home\site\wwwroot\node_modules\amqp10\lib\session.js:442:33): D:\home\site\wwwroot\node_modules\amqp10\lib\session.js: 442)
   at Session._processEndFrame (at Session._processEndFrame (D:\home\site\wwwroot\node_modules\amqp10\lib\session.js:319:8): D:\home\site\wwwroot\node_modules\amqp10\lib\session.js: 319)
   at Session._processFrame (at Session._processFrame (D:\home\site\wwwroot\node_modules\amqp10\lib\session.js:300:53): D:\home\site\wwwroot\node_modules\amqp10\lib\session.js: 300)
   at Connection._processFrameEH (at Connection._processFrameEH (D:\home\site\wwwroot\node_modules\amqp10\lib\session.js:186:49): D:\home\site\wwwroot\node_modules\amqp10\lib\session.js: 186)

jeffmcaffer avatar Feb 27 '17 17:02 jeffmcaffer

@jeffmcaffer my guess is something like this is going on:

  • the connection is broken because of
  • in this case we erroneously do not cleanup some very important book keeping information
  • you attempt to send a message, and it checks if it can currently send (in order to determine whether to push to pendingSends or not):
    • this code is checked, and it erroneously returns true for canSend because its using invalid information about some connection that is now long gone
  • now we go back to the call site of _sendMessage which indeed incorrectly attempts to write to the now non-existent connection, and all hell breaks loose

In other words: wrapping the _sendMessage is not the write approach imho, it shouldn't throw when things are setup properly. The problem, instead, is that it's called with incomplete setup.

mbroadst avatar Feb 27 '17 17:02 mbroadst

Agreed that the exception root cause is elsewhere.

Try/catch may still be applicable because sendMessage can still throw (it does a network call so unless that is guaranteed to not throw...) and there will be nothing to settle the promise. If you can't send (!canSend) then sendMessage() will be called from _dispatchPendingSends(). That has no ability to settle the promise. If I get my promise coding right (not a Bluebird person) sendMessage() throwing is handled by the initial .then() infrastructure (converted to a reject). However, the !canSend path is sendMessage() is just a function that just happened to be created during a call to new Promise. The invocation from _dispatchPendingSends() is not in the context of the promise infrastructure so a throw there leaves the promise dangling.

Should be testable by having canSend be false and then throwing from sendMessage. I've not looked at the test suite yet so not sure whats involved in setting that up.

jeffmcaffer avatar Feb 27 '17 18:02 jeffmcaffer

@jeffmcaffer its perhaps confusingly named, but the sendMessage that is pushed to pendingSends actually this which includes the logic for what to do wrt callback policy. dispatchPendingSends in this scenario is called when an attached event is processed, and therefore we should be covered there in expecting it not to throw. Unless I'm missing another potential case here, it seems that the likeliest path for an exception is that canSend is operating on bad assumptions.

I'll try to put a test case together here soon, but if you think you can trigger it other interesting ways the unit tests to look at would be here. We have a fairly flexible mock framework for simulating server response frames, etc.

mbroadst avatar Feb 27 '17 19:02 mbroadst

Will try to carve off some testing time this afternoon/evening.

I do still think there is a potential problem with sendMessage(). Consider the pared down version below for the !canSend() case. The fact that the sendMessage() function is declare in a Promise constructor is immaterial, it has resolve and reject handlers injected. Now, when the handler() function runs, sendMessage() is called and doSomethingThatCanThrow() runs and throws. In that case, who ever calls resolve or reject? If they never get called, the promise given back originally from send() never settles.

    var resolve, reject;    
    var sendMessage = function (err) {
      doSomeThingThatCanThrow();
      if (goodThingHappens())
        resolve(state);
      else
        reject();
    };

    if (!self.canSend()) {
      self._pendingSends.push(sendMessage);
      return;
    }

Then sometime later the following code runs

handler() {
      var sendMessage = this._pendingSends.shift();
      sendMessage(err);
}

jeffmcaffer avatar Feb 27 '17 19:02 jeffmcaffer

@jeffmcaffer right, what I'm saying is that given the connection is setup properly a call to _sendMessage cannot throw (or rather should not throw). We could wrap the subsequent call to the sendMessage wrapper with a try/catch, but it would not be solving the problem you guys are running into imho. So, while I agree that it's probably a safer approach in general, it would have only further hidden the actual problem here

mbroadst avatar Feb 27 '17 19:02 mbroadst

@jeffmcaffer is possible to share the log file. What I cannot understand is that you said that most of the time the issue occures in attaching state but if you look at the code of the canSend function the first line says when not attached return false. Normally after attaching you should receive attached frame and goto the attached state and then the pending messages are sent. It could be handy to show some sender code to get a better understanding of reproducing the issue

geoffreysamper avatar Feb 27 '17 20:02 geoffreysamper

@geoffreysamper that logging code was removed once we isolated the issue as not in our project. I'll look for an old file or we can look at adding it back if it will help. You can see the calling code here. Imagine push() being called many times per second.

More and more I think there are two separate but related issues and it may be confusing things to have them both in the same issue (happy to split the issue if you like)

  1. the exception being thrown (original problem)
  2. promises being lost (initially thought to be directly related but increasingly looking like a separate thing)

My code snippets above were related to the second issue. @mbroadst seems to be on a good path for the first problem.

The point about attaching is very apt wrt the second problem. canSend === false is indeed the case where it seems more likely that promises will be lost. That is because the sendMessage() function is executed outside the context of the Promise infrastructure (i.e., in the attach handling) so there is no one to catch the exception and resolve/reject the promise.

sendMessage can throw as _sendMessage() has several explicit throws and eventually calls down to the stream layer potentially throwing along the way.

One other thing that may be relevant here is that we have many (10s) of worker "loops" running concurrently. Each one gets a message from a queue (one of many), does some stuff, pushes (potentially 100+) messages to a queue, waits on all the promises to settle and repeats. What we see is that these worker die off. We narrowed it down to the last part of the loop, waiting on the returned send() promises to complete. When things go south, it seems that the workers die off slowly (over a minute or two) until none are left and the service stops working as there are a configured number of workers.

The lost promise problem happens with just one work loop but it takes longer. That seems like a load thing. We mitigate some by rate limiting our calls to send() to 100/s/process. While that seems to help, it does not solve the problem.

jeffmcaffer avatar Feb 27 '17 21:02 jeffmcaffer

I found a log. It was pretty verbose so I trimmed it down, put it in a Word doc and annotated with comments so you could tell what's what. It does not really show much other than the problem happens.

ampqlog.docx

jeffmcaffer avatar Feb 28 '17 05:02 jeffmcaffer

@jeffmcaffer question if I would setup the ospo-ghcrawler project and run it on my env would the issue occur on my env? If this is the case I will try to setup it my env and see better understand the problem. Is it correct that the client disconnect after a couple of minutes and you can recover from it? The reason that I asked is that I had a simular problem in my project.

geoffreysamper avatar Feb 28 '17 07:02 geoffreysamper

@jeffmcaffer any chance you have the amqp10 logs (DEBUG=amqp* in the environment variables) ?

I'm actually in the process of writing some unit tests around all of this, and am discovering the behavior to be that all SenderLink send promises are unconditionally rejected upon reconnect/reattach - a bug indeed, but also would indicate the cases you guys are encountering are almost impossible to reach. Specifically it looks like this:

So it's rather important to figure out exactly how your programs are getting into this state from tracing the logs, because it's not immediately obvious how this is even possible.

Another unrelated bug seems to be that the Session will resetLinkState on the Connection.Disconnected event, whether or not we are reconnecting.

FYI: my wip work is done in the issue295 branch. I won't be able to continue work until later tonight, or tomorrow.

mbroadst avatar Feb 28 '17 16:02 mbroadst

@geoffreysamper In theory you could set up the Crawler but that would likely require a some wasted effort on your part. I think we should be able to create a repro by just sending a ton of messages to a queue. Rather than running the whole system, perhaps just creating a queue (OspoCrawler.createAmqp10Queue) connected to Service Bus and hammering it with messages. I can look at that later today or you can give it a go. It will be something like

    const managementEndpoint = config.get('CRAWLER_SERVICEBUS_MANAGER_ENDPOINT');
    const url = config.get('CRAWLER_AMQP_URL');
    const manager = new ServiceBusQueueManager(url, managementEndpoint);
    const queue= OspoCrawler.createAmqp10Queue(manager, 'test', null, options);
    const promises = [];
    for (let i = 0; i <= 100000; i++) {
      promises.push(queue.push({ type: 'test', url: `http://${i}`}));
    }
    <check promises>

Does that make sense as a repro approach? If so, I can take a poke a putting that together.

jeffmcaffer avatar Feb 28 '17 17:02 jeffmcaffer

@mbroadst very interesting. Thanks for looking at this. We are definitely, from time to time, seeing the Force Detach errors BUT, I only started seeing them yesterday. That is, we were having all these problems without ever seeing the amqp:link:detach-forced messages.

One thing that I noticed is that if you call send() when !canSend() then sendMessage is not immediately called, it is added to the pendingSends list and a promise returned. Note that deferredSender() is not added to the unsettledSends list. So in the flow you described above, forceDetach() does not have a deferredSender to call and settled the promise that was previously returned.

I'll try the repro mentioned in my previous comment and see if messing with canSend() affects this.

Thanks again @mbroadst and @geoffreysamper. Super helpful to have you engaged on this.

jeffmcaffer avatar Feb 28 '17 17:02 jeffmcaffer

@jeffmcaffer I will first try to simulate with the amqp lib itself. So that I can verify if the issue is in the amqp lib or how amqp lib is used in the project. I had a simular issue in my another project and it was that after a couple of minutes I lost my connection and couldn't recover from it.
After some analysis we saw that the memory was always increasing and we lost the connection and bam not possible to recover. It will be for tomorrow morning I'm living in the utc + 1 timezone. I will try to simulate is as it is used in the project

geoffreysamper avatar Feb 28 '17 19:02 geoffreysamper

Thank @geoffreysamper. I played around a bit this evening. Similar to you, I started with just the AMQP code. To illustrate my scenario (lost promises) I imagined a scenario where the code send()s while !canSend(). Then eventually canSend and _dispatchPendingSends is called. That triggers the actual send which throws. In the code below sender is a configured and connected SenderLink. Indeed when it is run, the throw comes all the way out and is not catch by any Promise or AMQP infrastructure. Neither of the then() args are called. This may be too artificial but it does illustrate what I think is one of the issues at hand.

  const answers = [false, true];
  sender.canSend = () => {
    return answers.shift();
  };
  sender._sendMessage = () => {
    throw new Error('test');
  };

  sender.send({ test: 'message' }).then(
    () => console.log('pushed'),
    error => console.log('failed'));
});

jeffmcaffer avatar Mar 01 '17 07:03 jeffmcaffer

@jeffmcaffer Here is a small sample of my app. I will run this in azure to see if I can simulate the problem. Currently I ran it for an hour and I multiple time forced the connection to be killed with cports (windows) or tcpkill linux and the connection still recovers and I it send messages. Can you take a look at the sample so that we are on the same line. In the one hour that it ran the connection didn't get killed from azure itself. Can you confirm that you see a lot of reconnections

geoffreysamper avatar Mar 01 '17 08:03 geoffreysamper

@jeffmcaffer The sample app was running the since 10am to 16pm utc+1. I only received on timeout and it was a timeout. I try again in azure

geoffreysamper avatar Mar 01 '17 14:03 geoffreysamper

The sample is roughly inline with what is going on but does not load the queues enough IMHO. There is only one "loop" and only 100 messages per second with a 10ms gap between them. In our service there are 10+ loops each sending 100+ messages quickly (continuous with 10 concurrent). The key thing here (for the lost promise case) is that we have to get into a !canSend() condition and then have send errors. Need to either pound the service or induce issues.

I've been looking at @mbroadst tests and sorta see how to set this up but am not 100% sure how to make the last test (should send message after reattach) pass, it currently fails. My real goal is to then change it to cover the "send message fail causes the promise to be lost" case.

jeffmcaffer avatar Mar 01 '17 21:03 jeffmcaffer

@jeffmcaffer your pseudo-code above looks pretty good. The last case I added is incomplete because it lacks the second round-trip of connections. I'll try to post an updated case tonight.. I think we've identified a few bugs here imho, so I'll try to carve out some serious time towards the end of the week + weekend for all of this. Thanks for your patience and help both of you.

mbroadst avatar Mar 01 '17 21:03 mbroadst

@geoffreysamper I forked your sample repo to tweak the code to be more aggressive. @willbar (works with me) will also be poking at that and my fork of this repo to see if we can break it repeatably.

Thanks @mbroadst for looking at this. Let us know when there is something to try.

jeffmcaffer avatar Mar 01 '17 22:03 jeffmcaffer

@jeffmcaffer I check you sample and if I understand it correctly with the code below you want to proof that we lost promises? If this is the case it normal that the completed are less then the attempts because there are still some message that waiting for settelement. If you would replace the attempted with the completed you will see that all the message are delivered. The only problem then is that you will remark that the attempt is higher then 1000 and this is because of that we already started a new iteration of sending messages for this loop.

if (messageCount[loop].attempted % 1000 === 0){
     console.log(`attempted 1.000 message, completed ${messageCount[loop].completed} to the queue in loop ${loop}`);
 }

I rework my sample that sents 1000 message for each iteration and I wait before all the message of all iteration (loop 0,1,2,3,4,5,6,7,8,9) and then it restart see . If I'm running this you will see that no promises are lost.

For the other case with reconnection. I looked at the code for reconnection and it make use of the Fibonacci to calcultate the times when it need to reconnect. I will investigated this futher how the program reacts when a connection is killed multiple times see my repo

I will try to reproduce this error and I have a small idea how to do it

geoffreysamper avatar Mar 02 '17 08:03 geoffreysamper

@jeffmcaffer @mbroadst Hi, I was able to reproduce the problem for the hangs. Basicly here what happens it related to the reconnections I don't see any promise get lost.

  • connection opened
  • connection killed
  • reconnect
  • depending the number of connection lost the reconnection time will be longer. These are the reconnection times (0,1,1,2,4,8,13,21,34) in seconds. So this mean that the first time the connection will be lost it will reconnected with a timeout of 0 but when the connection is killed for the ninth time the wait will be 34 seconds before the connection is recovered and during this time there will be pending sends that will lead to long wait times and it seems like the application is hanging. You can see everything in my repo. I forked the amqp10 lib and add some extra events.

Here is a log that I annotated look at the higher number 21 seconds.

The only thing is some messages will get the force dettached exception and some will be added to the pendingmessages if we can't send. take a look at the timestamps

@mbroadst maybe it can be an idea when the reconnection succeeded that we reset the timeouts.

[11:57:05.522] STARTED interval set to 100
[11:57:06.037] connection opened pending message: 0, rejected messages: 0
[11:57:06.057] connected pending message: 0
[11:57:16.501] connection closed pending message: 100
[11:57:16.501] In 0 wil reconnect and num pending messages: 100

[11:57:17.288] connection opened pending message: 800, rejected messages: 159
[11:57:17.305] connected pending message: 800
[11:57:41.266] connection closed pending message: 100
[11:57:41.267] In 1000 wil reconnect and num pending messages: 100

[11:57:42.781] connection opened pending message: 500, rejected messages: 1066
[11:57:42.797] connected pending message: 500
[11:57:48.394] connection closed pending message: 1400
[11:57:48.394] In 1000 wil reconnect and num pending messages: 1400

[11:57:49.764] connection opened pending message: 400, rejected messages: 2699
[11:57:49.780] connected pending message: 400
[11:58:02.089] connection closed pending message: 1800
[11:58:02.090] In 2000 wil reconnect and num pending messages: 1800

[11:58:05.008] connection opened pending message: 900, rejected messages: 4012
[11:58:05.023] connected pending message: 900
[11:58:11.806] connection closed pending message: 1900
[11:58:11.806] In 3000 wil reconnect and num pending messages: 1900

[11:58:15.212] connection opened pending message: 400, rejected messages: 5015
[11:58:15.228] connected pending message: 400
[11:58:21.405] connection closed pending message: 1700
[11:58:21.405] In 5000 wil reconnect and num pending messages: 1700

[HANGS] for almost between 4 and 5 seconds.

[11:58:26.853] connection opened pending message: 400, rejected messages: 6820
[11:58:26.868] connected pending message: 400
[11:58:44.108] connection closed pending message: 100
[11:58:44.108] In 8000 wil reconnect and num pending messages: 100
[HANGS almost 8 seconds]

[11:58:52.576] connection opened pending message: 410, rejected messages: 8032
[11:58:52.594] connected pending message: 480
[11:59:01.245] connection closed pending message: 100
[11:59:01.245] In 13000 wil reconnect and num pending messages: 100

[HANGS] for 13 seconds

[11:59:14.700] connection opened pending message: 400, rejected messages: 13001
[11:59:14.717] connected pending message: 400
[11:59:39.463] connection closed pending message: 100
[11:59:39.463] In 21000 wil reconnect and num pending messages: 100

[HANGS] for almost  21 seconds

[12:00:01.046] connection opened pending message: 400, rejected messages: 21021
[12:00:01.061] connected pending message: 400
[12:00:32.554] connection closed pending message: 100
[12:00:32.555] In 34000 wil reconnect and num pending messages: 100

[HANGS] for almost 34 seconds

[12:01:07.477] connection opened pending message: 510, rejected messages: 33946
[12:01:07.499] connected pending message: 510
[12:04:36.006] connection closed pending message: 70

geoffreysamper avatar Mar 02 '17 11:03 geoffreysamper

@geoffreysamper Nice. Resetting the timeout sequence on reattach seems like an interesting idea but I don't think it would address our hangs. Ours are permanent hangs (e.g., stops at 0200 and is still hung at 0800). One related thing to note, most/many/all the hangs we have seen, when the send() is done, the link is in state === 'attaching'. If I understand the code, that means we are actually trying to attach so the timeout hanging is not at play there.

On the repro code I made, the attempted and completed were not expected to match, for the reasons you mentioned. What should happen though is that the lag should not get particularly big. Not the greatest indicator.

jeffmcaffer avatar Mar 02 '17 21:03 jeffmcaffer

Do you know their was any cpu usage or did you check the memory usage of the node process.

geoffreysamper avatar Mar 02 '17 21:03 geoffreysamper

Hi @jeffmcaffer @mbroadst I changed my program a little bit

  • the interval is set from 100ms to 10ms
  • the payload is know arround 9kb of text

I ran my program on azure and set the interval time to 10ms. and several times a kept killing the connection via tcpkill. (so mean connection open direct closed) I was able to simulate a hang of two minutes nothing more, cpu was hanging on 99% no more logging during this time but eventually it recovered. What I could do setup the ospcrawler and run it on a machine in azure with debug logs enabled.

@jeffmcaffer you said something from a rate limitter I would remove this in my test is this a lot of work?

I know this will take some time to setup but currently for me it the only way to simulate the problem. What do you both think?

geoffreysamper avatar Mar 03 '17 09:03 geoffreysamper

@jeffmcaffer @geoffreysamper Hey guys. So I've been playing around with the code today trying to break it, and in the process of refactoring some of the internals of the base Link module found an existing test that seemed to reproduce your findings - albeit in a very flakey fashion.

The base of what was going on was a few things:

  • we were using a (imho) superfluous DETACHING state, which was probably my fault due to a lack of adherence to the state machine (I'm now going to wade through and fix these issues)
  • the check for whether a connection shouldAttemptReattach was happening before the session fully unmapped, but also in the process of unmapping. This is a big no-no, since it will immediately kick off an attempt to reconnect (timeout of 0 when timeouts are generated), even though the rug is about to be pulled out from under it.

I sorted all of this out and now am unable to break the test in question, which makes me wonder if it might help in your cases as well. I've pushed the initial fixes to the issue295 branch, and will continue my refactoring there for the time being, until we can resolve all of your issues.

I want to thank you both for your patience on this issue, these are important edge cases we hadn't run into previously and I greatly value the feedback.

mbroadst avatar Mar 04 '17 18:03 mbroadst

@mbroadst Fantastic news. Thanks Matt. If you have something that you want some mileage on, let me and @willbar know. We have a test setup that we can use to beat on it. Right now in production are running with my fork but I'm fine moving that to something new pretty aggressively as well.

jeffmcaffer avatar Mar 04 '17 19:03 jeffmcaffer