stan.net
stan.net copied to clipboard
Invalid behavior when using same NATS connection for both NATS and STAN subscriptions
I had encountered quite invalid behavior:
We have single instance of NATS IConnection
. We use it to setup NATS subscriptions and the same instance is also used in IStanConnection
(passed via StanOptions.NatsConn
).
While setting up STAN subscriptions via IStanSubscription.Subscribe(subject, options, handler)
the NATSTimeoutException
is thrown. However our handler
is still being executed, processing the incoming messages. However as the exception is thrown from the Subscribe(...)
method, there is no instance of IStanSubscription
we could dispose.
We are encountering this behavior almost all the time during application startup, when most of the subscriptions are configured. However if we add new STAN subscription later, it works as expected.
I tried various things - using locks to serialize access to connection, tested it with or without actual data stored in NATS Streaming (we use FILE store). Nothing helped.
Finally I get rid of this behavior once I stopped passing existing IConnection
to STAN and let it create it's own connection.
Is reusing of same connection for both NATS and STAN supported? To my understanding, STAN is using NATS just as a transport layer and it should work just fine. If it's not the case, better mention that in a docs
Assuming it should be supported, then:
- There is a bug causing the STAN subscription end with NATSTimeoutException even when it's actually established
- If exception is throwing from
Subscribe(...)
method, it shouldn't actually establish the subscription and invoke the provided handler
Reusing the same connection is supported - we may have a bug here. I'm not sure why you'd be getting a timeout exception upon establishing a subscription. It is concerning that you have a "partial" subscription, meaning the client didn't clean up.
Which version of the client are you using, and would you be able to reproduce this in a simple app you could share?
Sure, I will try to reproduce it and post here link to the repo. However as it happens only in one of our services and only during the startup, it's hard to say how easy it will be to reproduce it. Maybe it's a threading issue. We will see.
I'm using the latest version of both NATS and STAN clients.
@dorny, We'd appreciate it, it'd really help us out. If you have a full stack trace from the NATSTimeoutException
and server logs from the time this happened might shed some light on this.
BTW, I found an issue where the subscription timeout is hardcoded to 2s. That needs to be made configurable. I'll create an issue for that.
Minimal example: https://gist.github.com/dorny/6255e9a6679a2f582c98e3fa93b91df0
My initial assumption that it's caused by some threading issue was wrong.
Bug is reproducible once you use IConnection.RequestAsync(...)
method.
Output of the code in gist:
Handler is invoked even if subscribe throws exception!
Bug successfully reproduced!
Timeout occurred.
at NATS.Client.Connection.requestSync(String subject, Byte[] data, Int32 offset, Int32 count, Int32 timeout)
at NATS.Client.Connection.request(String subject, Byte[] data, Int32 offset, Int32 count, Int32 timeout)
at NATS.Client.Connection.Request(String subject, Byte[] data, Int32 timeout)
at STAN.Client.AsyncSubscription.subscribe(String subRequestSubject, String subject, String qgroup, EventHandler`1 handler)
at STAN.Client.Connection.subscribe(String subject, String qgroup, EventHandler`1 handler, StanSubscriptionOptions options)
at STAN.Client.Connection.Subscribe(String subject, StanSubscriptionOptions options, EventHandler`1 handler)
at StanNetBug.Program.Main(String[] args) in C:\Users\dha\Workspace\KiMES\repos\StanNetBug\Program.cs:line 37
Unfortunately your test code (ported to tests) doesn't seem reproduce the error, I'll revisit that to be sure.
Here are my thoughts...
STAN subscriptions are created by:
- Creating an inbox to represent the subscription channel (STAN subject)
- Create a subscription on that inbox
- Make a NATS request to the streaming server to perform a streaming "subscribe"
- Wait for the response and process it.
I suspect the problem lies in a timeout on request in step 3, where the NATS streaming server receives the message and starts forwards data, but the response is lost before the client times out. I'll continue to look at this.
Regardless, I'll suggest increasing the timeouts - which is impossible at the moment - so I'll add #185 then we can test to see if a longer timeout resolves the issue for you.
When this happens does the call to STAN.Client.Subscribe
take more than 2s?
I'm pretty sure this is not an performance issue so increasing timeout won't help here. Of course making timeouts configurable would be good, it's just a different topic.
I tried my test code again and the bug is always reproduced. I'm running nats-streaming locally in docker. It's a fresh instance using memory store. No other client is connected. It's running on localhost so there are no network issues as well. I got same behavior also when running nats-streaming as windows service so it's not an docker (WSL on windows) issue.
Subscribe()
method call blocks for exactly 2s. Then the handler is invoked and just after that the Timeout exception is thrown.
If the line with call to RequestAsync() is commented out, Subscribe()
takes 4ms and no exception is thrown.
There must be some bug on at least one side of the connection - this client or NATS Streaming server itself.
I'll keep looking... thanks @dorny.
FYI, outside of xunit I was able to reproduce this with .NET core 3.1. When stepping through things work fine, so likely a timing/threading issue. I noticed .NET core framework itself crash a few times as well, not good.
Wow, you've found a good one... :) I'll let you know what I find.
Interesting find - if get the task and wait... success.
//await nc.RequestAsync("rpc.test", dummyMessage);
var t = nc.RequestAsync("rpc.test", dummyMessage);
t.Wait();
And this fails...
var t = nc.RequestAsync("rpc.test", dummyMessage);
await t;
I believe I've found the issue. It's a problem that's been resolved in the core NATS library. See: https://github.com/nats-io/nats.net/pull/404. Building with current source and upgrading the core NATS client to 0.11.0-pre2 cleared this up.
Because the new release of the core NATS client will require net 4.6 compatibility, I'll need to coordinate releases for both NATS.Client and STAN.Client releases. I'm planning on getting a release out of the core NATS client soon (hopefully this week). Then I'll get a pre-release of this client so you can verify.