stan.net icon indicating copy to clipboard operation
stan.net copied to clipboard

Invalid behavior when using same NATS connection for both NATS and STAN subscriptions

Open dorny opened this issue 3 years ago • 10 comments

I had encountered quite invalid behavior: We have single instance of NATS IConnection. We use it to setup NATS subscriptions and the same instance is also used in IStanConnection (passed via StanOptions.NatsConn).

While setting up STAN subscriptions via IStanSubscription.Subscribe(subject, options, handler) the NATSTimeoutException is thrown. However our handler is still being executed, processing the incoming messages. However as the exception is thrown from the Subscribe(...) method, there is no instance of IStanSubscription we could dispose.

We are encountering this behavior almost all the time during application startup, when most of the subscriptions are configured. However if we add new STAN subscription later, it works as expected.

I tried various things - using locks to serialize access to connection, tested it with or without actual data stored in NATS Streaming (we use FILE store). Nothing helped. Finally I get rid of this behavior once I stopped passing existing IConnection to STAN and let it create it's own connection.

Is reusing of same connection for both NATS and STAN supported? To my understanding, STAN is using NATS just as a transport layer and it should work just fine. If it's not the case, better mention that in a docs

Assuming it should be supported, then:

  1. There is a bug causing the STAN subscription end with NATSTimeoutException even when it's actually established
  2. If exception is throwing from Subscribe(...) method, it shouldn't actually establish the subscription and invoke the provided handler

dorny avatar Aug 31 '20 19:08 dorny

Reusing the same connection is supported - we may have a bug here. I'm not sure why you'd be getting a timeout exception upon establishing a subscription. It is concerning that you have a "partial" subscription, meaning the client didn't clean up.

Which version of the client are you using, and would you be able to reproduce this in a simple app you could share?

ColinSullivan1 avatar Sep 01 '20 17:09 ColinSullivan1

Sure, I will try to reproduce it and post here link to the repo. However as it happens only in one of our services and only during the startup, it's hard to say how easy it will be to reproduce it. Maybe it's a threading issue. We will see.

I'm using the latest version of both NATS and STAN clients.

dorny avatar Sep 01 '20 20:09 dorny

@dorny, We'd appreciate it, it'd really help us out. If you have a full stack trace from the NATSTimeoutException and server logs from the time this happened might shed some light on this.

BTW, I found an issue where the subscription timeout is hardcoded to 2s. That needs to be made configurable. I'll create an issue for that.

ColinSullivan1 avatar Sep 02 '20 15:09 ColinSullivan1

Minimal example: https://gist.github.com/dorny/6255e9a6679a2f582c98e3fa93b91df0

My initial assumption that it's caused by some threading issue was wrong. Bug is reproducible once you use IConnection.RequestAsync(...) method.

Output of the code in gist:

Handler is invoked even if subscribe throws exception!
Bug successfully reproduced!
Timeout occurred.
   at NATS.Client.Connection.requestSync(String subject, Byte[] data, Int32 offset, Int32 count, Int32 timeout)
   at NATS.Client.Connection.request(String subject, Byte[] data, Int32 offset, Int32 count, Int32 timeout)
   at NATS.Client.Connection.Request(String subject, Byte[] data, Int32 timeout)
   at STAN.Client.AsyncSubscription.subscribe(String subRequestSubject, String subject, String qgroup, EventHandler`1 handler)
   at STAN.Client.Connection.subscribe(String subject, String qgroup, EventHandler`1 handler, StanSubscriptionOptions options)
   at STAN.Client.Connection.Subscribe(String subject, StanSubscriptionOptions options, EventHandler`1 handler)
   at StanNetBug.Program.Main(String[] args) in C:\Users\dha\Workspace\KiMES\repos\StanNetBug\Program.cs:line 37

dorny avatar Sep 11 '20 10:09 dorny

Unfortunately your test code (ported to tests) doesn't seem reproduce the error, I'll revisit that to be sure.

Here are my thoughts...

STAN subscriptions are created by:

  1. Creating an inbox to represent the subscription channel (STAN subject)
  2. Create a subscription on that inbox
  3. Make a NATS request to the streaming server to perform a streaming "subscribe"
  4. Wait for the response and process it.

I suspect the problem lies in a timeout on request in step 3, where the NATS streaming server receives the message and starts forwards data, but the response is lost before the client times out. I'll continue to look at this.

Regardless, I'll suggest increasing the timeouts - which is impossible at the moment - so I'll add #185 then we can test to see if a longer timeout resolves the issue for you.

When this happens does the call to STAN.Client.Subscribe take more than 2s?

ColinSullivan1 avatar Sep 25 '20 23:09 ColinSullivan1

I'm pretty sure this is not an performance issue so increasing timeout won't help here. Of course making timeouts configurable would be good, it's just a different topic.

I tried my test code again and the bug is always reproduced. I'm running nats-streaming locally in docker. It's a fresh instance using memory store. No other client is connected. It's running on localhost so there are no network issues as well. I got same behavior also when running nats-streaming as windows service so it's not an docker (WSL on windows) issue.

Subscribe() method call blocks for exactly 2s. Then the handler is invoked and just after that the Timeout exception is thrown. If the line with call to RequestAsync() is commented out, Subscribe() takes 4ms and no exception is thrown.

There must be some bug on at least one side of the connection - this client or NATS Streaming server itself.

dorny avatar Sep 26 '20 13:09 dorny

I'll keep looking... thanks @dorny.

ColinSullivan1 avatar Sep 28 '20 00:09 ColinSullivan1

FYI, outside of xunit I was able to reproduce this with .NET core 3.1. When stepping through things work fine, so likely a timing/threading issue. I noticed .NET core framework itself crash a few times as well, not good.

Wow, you've found a good one... :) I'll let you know what I find.

ColinSullivan1 avatar Sep 30 '20 03:09 ColinSullivan1

Interesting find - if get the task and wait... success.

            //await nc.RequestAsync("rpc.test", dummyMessage);
            var t = nc.RequestAsync("rpc.test", dummyMessage);
            t.Wait();

And this fails...

            var t = nc.RequestAsync("rpc.test", dummyMessage);
            await t;

ColinSullivan1 avatar Sep 30 '20 14:09 ColinSullivan1

I believe I've found the issue. It's a problem that's been resolved in the core NATS library. See: https://github.com/nats-io/nats.net/pull/404. Building with current source and upgrading the core NATS client to 0.11.0-pre2 cleared this up.

Because the new release of the core NATS client will require net 4.6 compatibility, I'll need to coordinate releases for both NATS.Client and STAN.Client releases. I'm planning on getting a release out of the core NATS client soon (hopefully this week). Then I'll get a pre-release of this client so you can verify.

ColinSullivan1 avatar Sep 30 '20 19:09 ColinSullivan1