nats.go icon indicating copy to clipboard operation
nats.go copied to clipboard

Connect error visibility

Open ripienaar opened this issue 7 years ago • 8 comments

The client hides a number of early errors from the user, take the code here:

Client with obvious errors
package main

import(
        nats "github.com/nats-io/go-nats"
        log "github.com/sirupsen/logrus"
)

func main() {
        _, err := nats.Connect("nats://node1, nats://node2",
                nats.MaxReconnects(10),
                nats.DisconnectHandler(func(nc *nats.Conn) {
                        log.Warnf("NATS client connection got disconnected: %s", nc.LastError())
                }),
                nats.ReconnectHandler(func(nc *nats.Conn) {
                        log.Warnf("NATS client reconnected after a previous disconnection, connected to %s", nc.ConnectedUrl())
                }),
                nats.ClosedHandler(func(nc *nats.Conn) {
                        log.Warnf("NATS client connection closed: %s", nc.LastError())
                }),
                nats.ErrorHandler(func(nc *nats.Conn, sub *nats.Subscription, err error) {
                        log.Errorf("NATS client on %s encountered an error: %s", nc.ConnectedUrl(), err.Error())
                }),
        )

        if err != nil {
                log.Warnf("Connection failed %s", err.Error())
        }
}

Here I have a client with obvious errors - no ports on the URLs - this produce simply WARN[0000] Connection failed nats: no servers available for connection and no callbacks are called.

In reality the error is detected but squashed here https://github.com/nats-io/go-nats/blob/34c8842105ac0b69c838a9998a239d482936c466/nats.go#L1065

If I add a fmt.Println(err.Error()) there I get:

dial tcp: address node2: missing port in address
dial tcp: address node1: missing port in address

So I get it's hard to report this sanely because the API only reports back a single error on nats.Connect and these are many errors. Generally I think the early Connect need to provide a lot more visibility. It's impossible for us to provide feedback to users in our logs about which server we're attempting to connect to or why an individual server connection failed - one might be down - it's impossible to log this.

I'd like to be able to log things like:

attempting to connect to nats server pool nats://foo:4222, nats://bar:4222 attempting to connect to nats://foo:4222 connection established to nats://foo:4222 failed to connect to nats://foo:4222 due to <error here>

etc, useful things like this which will help users debug why software I ship is seemingly not working.

ripienaar avatar Jul 19 '17 07:07 ripienaar

@ripienaar Yes, I believe that at like 1066, we should set returnedErr = err. This would solve the problem where if the last connect to be tried has an error different from connection refused, then we would return the specific error instead of ErrNoServers.

The problem I have with the usage of a callback here, is that the NATS connection object from a user perspective is not yet created. So it would be dangerous to have a callback with a *nats.Conn. The callback would have to only accept an error string, or adding an interface{} representing user data that would allow the user to differentiate a given "connection" if the callback is reused in several Connect() calls.

Do you think this callback should be called synchronously in the connect loop? with the danger of the user blocking the connect loop. Would that also makes sense for this callback to return a boolean (or an error) indicating if the connect should be stopped? What do you think?

kozlovic avatar Jul 19 '17 16:07 kozlovic

To me I think if the user pass it - ie. its optional - synchronous is fine, this early on in the life cycle it's effectively all synchronous anyway and subject to all manner of blockages.

So it seems reasonable to me to have a very simple interface that can be used to extract that errors. I do think it's important to be able to get every error as the loop cycles the candidates and also as indicated a before a connection attempt a call back to give useful end user feedback.

The current situation is that should say a TCP SYN packet go unanswered you have timeout * n servers of just ... nothing. No possible feedback to users in logs or elsewhere. This is surprising and my experience of writing end user tools tells me this is going to result in a ton of support requests. In the past I used another middleware tech and in its library that had the exact same limitation - we added a sync call back based logging during the initial phase - it was exactly like this. I cant comment for everyone but for me that worked great as long as it remains optional.

I also had similar requests open on the ruby NATS client :)

ripienaar avatar Jul 19 '17 17:07 ripienaar

Can you comment on the prototype of such callback? Again, I think that passing the connection is not an option. Would the error just be enough? What about the idea of being able to cancel the connect loop? Is that an overkill?

kozlovic avatar Jul 19 '17 17:07 kozlovic

If we can't pass the connection then I guess we don't have pretty options

  • func(message string, current_server string, err error) here err can be nil for the happy path
  • func(type int, current_server string, err error) here type is a constant indicating the type of event - success, fail, attempting etc. I imagine if there's some catastrophic error thats not server specific current_server could be "" but that would be known from the type
  • much like above but entirely differently named callbacks rather than the type constant. so similar to the current nats.FooHandler, make a Handler per type

None of seem awesome but the 2nd or 3rd would be better since I can supply my own text based on the type of error.

With the stomp gem we ended up with a huge call back list https://github.com/stompgem/stomp/blob/dev/examples/lflogger.rb#L9-L34 of course given the dynamic nature of ruby things are a bit easier there too

Regarding cancel - yes interesting idea, I do not have a use case for this though.

I have other needs that I discussed in the past with @wallyqs in the ruby client project regarding the server list. I like to resolve my server list from SRV records in DNS. Problem is my software is long running - months - and people change DNS. I would like a way to refresh the server list of a client without recreating the whole thing. Think of it like a refresh server list and then you enter the connection retry loop. How this relates to this initial cycle is - I might want to retry DNS lookups at the end of every loop through the server list.

ripienaar avatar Jul 20 '17 08:07 ripienaar

Bump. The error handling could really use some improvements. For example, the same ErrNoServers error is returned from 7 different places in the code, so there is no straightforward way to tell where the error was thrown.

ArtursL avatar Nov 05 '19 10:11 ArtursL

I'm also running into this issue very early on into trying out nats. I would like to log reconnection attempts (or at the very least failures of this). But it seems I have no way to do this right now, I'm happy to be proven wrong ofcourse

advdv avatar Aug 16 '22 11:08 advdv