nats.go
nats.go copied to clipboard
Connect error visibility
The client hides a number of early errors from the user, take the code here:
Client with obvious errors
package main
import(
nats "github.com/nats-io/go-nats"
log "github.com/sirupsen/logrus"
)
func main() {
_, err := nats.Connect("nats://node1, nats://node2",
nats.MaxReconnects(10),
nats.DisconnectHandler(func(nc *nats.Conn) {
log.Warnf("NATS client connection got disconnected: %s", nc.LastError())
}),
nats.ReconnectHandler(func(nc *nats.Conn) {
log.Warnf("NATS client reconnected after a previous disconnection, connected to %s", nc.ConnectedUrl())
}),
nats.ClosedHandler(func(nc *nats.Conn) {
log.Warnf("NATS client connection closed: %s", nc.LastError())
}),
nats.ErrorHandler(func(nc *nats.Conn, sub *nats.Subscription, err error) {
log.Errorf("NATS client on %s encountered an error: %s", nc.ConnectedUrl(), err.Error())
}),
)
if err != nil {
log.Warnf("Connection failed %s", err.Error())
}
}
Here I have a client with obvious errors - no ports on the URLs - this produce simply WARN[0000] Connection failed nats: no servers available for connection
and no callbacks are called.
In reality the error is detected but squashed here https://github.com/nats-io/go-nats/blob/34c8842105ac0b69c838a9998a239d482936c466/nats.go#L1065
If I add a fmt.Println(err.Error())
there I get:
dial tcp: address node2: missing port in address
dial tcp: address node1: missing port in address
So I get it's hard to report this sanely because the API only reports back a single error on nats.Connect
and these are many errors. Generally I think the early Connect
need to provide a lot more visibility. It's impossible for us to provide feedback to users in our logs about which server we're attempting to connect to or why an individual server connection failed - one might be down - it's impossible to log this.
I'd like to be able to log things like:
attempting to connect to nats server pool nats://foo:4222, nats://bar:4222
attempting to connect to nats://foo:4222
connection established to nats://foo:4222
failed to connect to nats://foo:4222 due to <error here>
etc, useful things like this which will help users debug why software I ship is seemingly not working.
@ripienaar Yes, I believe that at like 1066, we should set returnedErr = err
. This would solve the problem where if the last connect to be tried has an error different from connection refused
, then we would return the specific error instead of ErrNoServers
.
The problem I have with the usage of a callback here, is that the NATS connection object from a user perspective is not yet created. So it would be dangerous to have a callback with a *nats.Conn
. The callback would have to only accept an error string, or adding an interface{}
representing user data that would allow the user to differentiate a given "connection" if the callback is reused in several Connect()
calls.
Do you think this callback should be called synchronously in the connect loop? with the danger of the user blocking the connect loop. Would that also makes sense for this callback to return a boolean (or an error) indicating if the connect should be stopped? What do you think?
To me I think if the user pass it - ie. its optional - synchronous is fine, this early on in the life cycle it's effectively all synchronous anyway and subject to all manner of blockages.
So it seems reasonable to me to have a very simple interface that can be used to extract that errors. I do think it's important to be able to get every error as the loop cycles the candidates and also as indicated a before a connection attempt a call back to give useful end user feedback.
The current situation is that should say a TCP SYN packet go unanswered you have timeout * n servers
of just ... nothing. No possible feedback to users in logs or elsewhere. This is surprising and my experience of writing end user tools tells me this is going to result in a ton of support requests. In the past I used another middleware tech and in its library that had the exact same limitation - we added a sync call back based logging during the initial phase - it was exactly like this. I cant comment for everyone but for me that worked great as long as it remains optional.
I also had similar requests open on the ruby NATS client :)
Can you comment on the prototype of such callback? Again, I think that passing the connection is not an option. Would the error just be enough? What about the idea of being able to cancel the connect loop? Is that an overkill?
If we can't pass the connection then I guess we don't have pretty options
-
func(message string, current_server string, err error)
here err can be nil for the happy path -
func(type int, current_server string, err error)
here type is a constant indicating the type of event - success, fail, attempting etc. I imagine if there's some catastrophic error thats not server specificcurrent_server
could be""
but that would be known from the type - much like above but entirely differently named callbacks rather than the type constant. so similar to the current
nats.FooHandler
, make aHandler
per type
None of seem awesome but the 2nd or 3rd would be better since I can supply my own text based on the type of error.
With the stomp gem we ended up with a huge call back list https://github.com/stompgem/stomp/blob/dev/examples/lflogger.rb#L9-L34 of course given the dynamic nature of ruby things are a bit easier there too
Regarding cancel - yes interesting idea, I do not have a use case for this though.
I have other needs that I discussed in the past with @wallyqs in the ruby client project regarding the server list. I like to resolve my server list from SRV records in DNS. Problem is my software is long running - months - and people change DNS. I would like a way to refresh the server list of a client without recreating the whole thing. Think of it like a refresh server list and then you enter the connection retry loop. How this relates to this initial cycle is - I might want to retry DNS lookups at the end of every loop through the server list.
Bump. The error handling could really use some improvements. For example, the same ErrNoServers
error is returned from 7 different places in the code, so there is no straightforward way to tell where the error was thrown.
I'm also running into this issue very early on into trying out nats. I would like to log reconnection attempts (or at the very least failures of this). But it seems I have no way to do this right now, I'm happy to be proven wrong ofcourse