yugabyte-db icon indicating copy to clipboard operation
yugabyte-db copied to clipboard

Go cassandra driver (gocql) connection pool failure after Yugabyte rolling restart

Open justinfx opened this issue 4 years ago • 13 comments

Jira Link: DB-1082 (This issue should be checked after the client is updated to the parent via #9818 but may still be an open issue)

Slack discussion: https://yugabyte-db.slack.com/archives/CG0KQF0GG/p1632856106352700

Related parent gocql project issues:

  • https://github.com/gocql/gocql/issues/1575
  • https://github.com/gocql/gocql/pull/1555
  • https://github.com/gocql/gocql/issues/915

In trying out the situation of rolling restarts on the Yugabyte Operator cluster, I have observed a problem where a client application will end up with no hosts left in the connection pool and only a restart of the application will resolve the problem.

gocql: no hosts available in the pool

I've done some more testing and what I have found is that using the individual dns names seems to be ideal since they match up with the broadcast_address assigned by the operator to each tserver:

yb-tserver-0.yb-tservers.my-namespace.svc.cluster.local
yb-tserver-1.yb-tservers.my-namespace.svc.cluster.local
yb-tserver-2.yb-tservers.my-namespace.svc.cluster.local

Then if I scale the tservers from 3 -> 2, I will see the connection event logged in my client that one is now timing out (expected). And when I scale the tservers back from 2 -> 3, there is a new ip, but the connection timeout error to the old ip stop logging, and the new one logs as up and things seem happy. The problem is when the tservers are rolled too quickly, even if I am using OrderedReady pod management policy to make them restart one at a time. It seems in this case my client completely loses all hosts and doesn't work until a restart. So I feel like I must be missing some kind of client configuration to cope with this situation. Or maybe the yb operator is rolling them too fast? Or maybe I need to be careful and manually roll tservers. But that doesn't help in the situation where the nodes go down for another reason, and then come back up with new ips, and the application client is broken until restart.

https://github.com/gocql/gocql/issues/1575 refers to a possible issue in the connection pool implementation where hosts are never re-resolved to an IP at a later point in time after the initial connection.

justinfx avatar Oct 04 '21 19:10 justinfx

Thanks. Will try and verify if this issue still persists once the Yugabyte gocql is synced with the latest upstream gocql.

ashetkar avatar Oct 06 '21 06:10 ashetkar

Hi @ashetkar. Thanks for doing the recent gocql driver update! Just wanted to let you know that I gave this reproduction a test with the updated client and I was still able to reproduce the reported issue.

When I scale the tablet servers from 3 to 2, the DOWN state connection messages are reported in the client connection callback, and when I scale back up to 3 it never recovers the new 3rd endpoint IP. It only keeps checking the old IP at intervals.

Even worse, when I scale the tablet servers to 0 and then back up to 3, my application's client no longer can restore its connection pool. It just cycles on states UP and DOWN for the old pod IP addresses and never picks up the new endpoints.

So I would consider this issue really important in our choice to adopt yugabyte because I am concerned that any kind of temporary outage would leave clients of the yb cluster broken until restarted. And it would seem that currently the only safe option is to run the cluster in a way where the IP addresses are fixed, such as on VMs.

justinfx avatar Oct 25 '21 17:10 justinfx

When I scale the tablet servers from 3 to 2, the DOWN state connection messages are reported in the client connection callback, and when I scale back up to 3 it never recovers the new 3rd endpoint IP. It only keeps checking the old IP at intervals.

I spoke too soon. In this case, after > 3.5 minutes it finally stops checking the old endpoint IP of the tserver and switches to the new ip with a status of UP. So there must be a pretty long cache somewhere that ends up clearing, and it works as long as the whole cluster doesn't go down.

Losing all tservers, however, still results in no recovery.

justinfx avatar Oct 25 '21 18:10 justinfx

Hi @justinfx, On the updated driver, I too could reproduce the issue where bringing the tservers from 3 to 0, throws the “gocql: no hosts available in the pool” message.

But when I tried to re-connect after a sleep of around 30-40 seconds, the app could see the new IP addresses and resume its work. I used your code snippet posted in the Slack channel. This may not be a very elegant workaround, but it did help.

I’m not a k8s expert so not sure if and how one can tweak the time period for which the old IPs are cached.

ashetkar avatar Oct 27 '21 12:10 ashetkar

Thanks for starting to look at this, @ashetkar !

For completeness in this ticket, here is the example code snippet I had provided in the slack channel:

type CQLStore struct {
    cluster *gocql.ClusterConfig
    session *gocql.Session
}

func NewCQLStore(hosts ...string) *CQLStore {
    cluster := gocql.NewCluster(hosts...)
    cluster.Keyspace = "mykeyspace"
    cluster.Timeout = 12 * time.Second
    cluster.ReconnectionPolicy = &gocql.ExponentialReconnectionPolicy{
        MaxRetries:      5,
        InitialInterval: 100 * time.Millisecond,
        MaxInterval:     1 * time.Second,
    }
    
    return &CQLStore{cluster: cluster}
}

func (s *CQLStore) Open() error {
    // resolve host names to ips to ensure that peer membership
    // changes are detected properly (nodes up and down)
    var ips []string
    for _, host := range s.cluster.Hosts {
        resolved, err := net.LookupIP(host)
        if err != nil {
            return fmt.Errorf("error resolving host %q to ip: %w", host, err)
        }
        for _, ip := range resolved {
            if ip.To4() == nil {
                continue
            }
            ips = append(ips, ip.String())
        }
    }
    s.cluster.Hosts = ips

    var err error
    s.session, err = s.cluster.CreateSession()
    return err
}

func (s *CQLStore) Close() error {
    s.session.Close()
    if !s.session.Closed() {
        return errors.New("Could not close session")
    }
    return nil
}

So to be clear, at this current point, and based on some of the responses in the parent gocql tickets, I stopped using the manual resolve of hosts to ips when I configure the driver and am still using the pod dns names:

yb-tserver-0.yb-tservers.my-namespace.svc.cluster.local
yb-tserver-1.yb-tservers.my-namespace.svc.cluster.local
yb-tserver-2.yb-tservers.my-namespace.svc.cluster.local

I’m not a k8s expert so not sure if and how one can tweak the time period for which the old IPs are cached.

I'm not convinced it is a k8s issue here. Because I can run this an immediately see the ips changing as the tserver pods are scaled down and up:

kubectl -n <namespace> get endpoints yb-tservers -o jsonpath='{.subsets[*].addresses[*].ip}'

I can run that while I am scaling and watching my application log, to compare the connection callback messages and the ips that the driver is trying to use. It looks pretty clearly like while the new tserver has a properly published new ip, the gocql driver still has the old resolved IP that it continues to try. And when all tservers have been restarted, it is always just looking at that old set of ips. It would seem like the driver just never goes back to the originally configured host names given to it and resolve again.

But when I tried to re-connect after a sleep of around 30-40 seconds, the app could see the new IP addresses and resume its work....This may not be a very elegant workaround, but it did help.

When you say you did a 're-connect', what does that entail? Does it mean you restarted the application? Or does it mean that in code you have it close and open the client driver? Restarting feels extreme because it implies that I would need to monitor when this problem is occurring and manage a restart of the application nodes. And the programmatic close/open of the client driver seems like its a workaround for something the driver should actually be doing, right? There must be some point, based on the driver configuration options, that after enough retries and timeouts with the entire connection pool being unavailable, that it would automatically flush the hosts and re-resolve.

What is your opinion on the current behavior of the client driver in this case where the connection pool becomes unusable? Do you think the application should implement some kind of mitigation by observing the connection failure events and recycling the client? Is this caused by https://github.com/gocql/gocql/issues/915 ?

justinfx avatar Oct 27 '21 18:10 justinfx

Yes, I was using the above code to explicitly resolve dns names to ip addresses. And by 're-connect' I meant recreating the session (and not restarting the application itself).

Agree that this is something better handled at the driver layer itself rather than in the application. This issue is still seen despite a fix for https://github.com/gocql/gocql/issues/915 (it is still open).

Looking into the code to better understand it. Will update soon.

ashetkar avatar Oct 29 '21 19:10 ashetkar

Hi @ashetkar. Since this is still an open issue and I need to ensure my application has a stable connection to the cluster, I am going to have to apply the client-side workaround for now. I've been looking at the various Observer hooks in the client and I can see when there are ObserveConnect errors with a DOWN state and timeout. But what I am really trying to hook into is the "no hosts available in the pool" error. Do you have advice on the best place to determine that all hosts have been exhausted in the pool so I can trigger a reconnect on the client? It seems like my two options are:

  1. Watch ObserverConnect errors for enough DOWN states and then check the host pool manually?
  2. Wrap the client to intercept query errors or check every error directly, for ErrNoConnections

justinfx avatar Nov 22 '21 20:11 justinfx

Thought I would update this ticket with the details about the extent I had to go to, in order to work around this issue on the client application side of things...

Because as the application using the client driver doesn't have enough access to the internal state of the connection pool, I had to do the best I could to use a ConnectObserver to watch details about the host events. The high level logic goes something like this:

  • Open the client driver, by installing custom ConnectObserver, and create Session. The session access must be guarded by a mutex (RWMutex)
  • All access to the session must be guarded by a read lock and needs to be "checked out" as a local variable before use. After usage of the session, it has to be checked back in with the lock and determine if it has been replaced in the meantime. If it was replaced, then close the session we checked out.
  • ConnectObserver maintains a host map count, last timestamp of reconnect, and if it is currently reconnecting.
    • For every error event that also contains a connect address...
      • if the host is "up", remove it from the map
      • if the host is "down", add it to the map
      • If there is only one configured host and it is down, OR, if more than 2 configured hosts are down
        • check that we have not reconnected in the last 10 seconds
        • check that we are not concurrently reconnecting (concurrent access to client driver)
        • lock-acquire the session, close, and recreate a new session
        • update last reconnect time

The solution is not perfect because it still results in queries hanging a bit while this detection logic plays out. And I had to make sure to handle the fact that I can only "reconnect" by replacing the session, which is in concurrent use. So I have to be careful to guard the access to the session and replace it with a new one without interfering with a query that is already in progress.

Ideally this would be handled more accurately within the client driver pool and would not need to replace the session object. The driver needs to be smarter in knowing when the host pool is exhausted and needs to re-resolve the originally configured host list dns and update the pool again. It should be as transparent as possible for concurrent consumers of the session.

justinfx avatar Dec 14 '21 17:12 justinfx

@ashetkar has there been any progress on this?

justinfx avatar Oct 14 '22 23:10 justinfx

@ashetkar this has been reported as fixed in gocql 1.4.0 via https://github.com/gocql/gocql/pull/1682

Any chance the yugabyte go client could be updated for these fixes?

justinfx avatar May 07 '23 01:05 justinfx

@justinfx Thanks for pointing it out to us. Yes, we need to update the driver to the latest upstream master anyways, because it has quite a sizeable number of commits since we last updated. Let me come back on this in a day or two with an approximate timeline.

ashetkar avatar May 09 '23 02:05 ashetkar

Sorry for the delay in responding. We are targeting this for the next month. Will keep this ticket updated, thanks.

ashetkar avatar May 26 '23 05:05 ashetkar

We have upgraded the YB Gocql driver to the upstream Gocql version 1.6.0 in January this year. The fix mentioned above in version 1.4.0 of upstream gocql should now be available in this fork as well.

ashetkar avatar May 08 '24 11:05 ashetkar