gocql query retries rarely cause retries because hostpoolpolicy aborts query execution

So I don't know if this works as designed or not. At least I found things counter-intuitive and spent quite some time troubleshooting what's going on.

In my client app I've implemented a config option to enable query retries, which just enables a SimpleRetryPolicy within gocql. I then use toxiproxy to artifically slow down cassandra, somewhat randomly, so that some queries won't be slowed down much, while other are slower than the timeout. Note that in my testing i just use 1 cassandra instance.

As I've tested this, with higher and higher query retry settings, i notice that only some queries are retried, and only a limited amount, definitely not to the amount permitted by the SimpleRetryPolicy. And so they still return timeout errors, even though could have been retried more often and returned a valid result instead.

The reason is that queryExecutor.executeQuery(qry ExecutableQuery) stops attempting a query when the hostpool Pick() function returns nil, instead of a valid host.

I thought I had read somewhere that when a hostpool becomes empty (e.g. all hosts timed out), then they would all be re-added again, so that the query could be retried again. But this does not seem to be case.

I verified this with both the round-robin host selection policy as well tokenAware with a simple hostpool backend. (I know tokenaware doesn't effect anything here because i have 1 node, presumably this is equivalent to using hostpool-simple directly)

with NumRetries 3/4/5/6/... and round-robin i saw that many queries were only tried twice (retried once) and timed out, and with tokenaware+hostpool-simple only 2 were tried twice, resulting in a timeout and "too many timeouts" errors, all queries afterwards were only tried once.

is this normal?

Oct 26 '16 13:10 Dieterbe

Yep, I realised whilst refactoring to add the queryExecutor that the retry policy interface is severely limiting, I had created #735.

Some ideas spring to mind

Eagerly consume from the host iterator
Improve retry policy interface to return next host for a given error + query.
Make retry policy take a list of hosts or a HostIter, then have it return a query plan with methods like,

type RetryPolicy interface {
    QueryPlan(hosts []HostInfo) QueryPlan
}

type QueryPlan interface {
    NextHost(err) (HostInfo, bool)
}

Then the query plan can have a view of all hosts it can return for a query, it can decide if the query should be tried again on the same host, same rack, same dc, cross dc. Or if the error should not be retried at all.

Oct 26 '16 21:10 Zariel

Something I can see happening is host priority, the host selector is returning a list of hosts in some priority, the retry policy needs to respect that as well.

Oct 29 '16 11:10 Zariel

Isn't it a problem that by the time a query needs to be retried the host priority list may be out of date ? Hosts may have gone down, marked slow, etc. Maybe it needs to refresh the list at every try ? Seems like host selection and retry policy are very intertwined matters, maybe we should just merge them into one thing. E.g. the host policies could contain the code for how/when to retry queries.

Oct 29 '16 16:10 Dieterbe

true, but that greatly complicates the policies and makes them much harder to implement. I don't think its entirely a problem as if the retry policy is passed an []*HostInfo then it will know if a host is down due to the shared HostInfo, but wont get hosts that are up. Potentially we could returned downed hosts from the host selector which can be passed through.

I don't entirely see it being a problem as query retry will be taking place in sub 50ms.

Oct 29 '16 17:10 Zariel

if the retry policy is passed an []*HostInfo then it will know if a host is down due to the shared HostInfo, but wont get hosts that are up. Potentially we could returned downed hosts from the host selector which can be passed through.

But hostSelectionPolicies can be more finegrained AFAIK and use a more nuanced criterium such as latency, rather then binary up/down. Isn't the epsilon-greedy hostpool an example of this? AFAIK at every single query it may adjust its preferences. (with hundreds of qps that means every few ms). Also if host was dead at first, but then became live again and a connection was re-established, i would want to incorporate it into the retries.

I don't entirely see it being a problem as query retry will be taking place in sub 50ms

maybe if all your queries return within 50ms. we definitely see longer running queries that take >1000 ms, and in fact the reason for retries in our environment is usually due to hitting our timeouts limit. we're still experimenting with the various settings but at this point we have timouts 2000~3000ms and our queries are hitting that (unfortunately).

Oct 29 '16 18:10 Dieterbe

@Dieterbe, could you please comment on the issue? What is the status of it? There have been a lot of changes in the driver since the issue was created. Is it still relevant for you?

Mar 25 '25 11:03 OleksiienkoMykyta

We have migrated away from Cassandra a few years ago, so we haven't used gocql since then

Mar 27 '25 02:03 Dieterbe

We've also encountered this problem when implementing unlimited retries to Cassandra, required by our business logic. It is not enough to only enable retries due to the fact of single hosts traversal (e.g.,here).

You can enable retries without changing queried host. This way you will be limited by only your RetryPolicy options like number of attempts. But obviously this is not so reliable

We've implemented a wrapper for HostSelectionPolicy in order to wait for available hosts forever (but, sure, you can change it). Not the best implementation, but suitable for any host selection policy.

Can be used for instance this way

HostSelectionPolicy: &EndlessHostSelectionPolicy{gocql.RoundRobinHostPolicy()}

And the implementation

type EndlessHostSelectionPolicy struct {
	HostPolicy gocql.HostSelectionPolicy
}

var _ gocql.HostSelectionPolicy = (*EndlessHostSelectionPolicy)(nil)

func (p *EndlessHostSelectionPolicy) Pick(q gocql.ExecutableQuery) gocql.NextHost {
	hostIter := p.HostPolicy.Pick(q)
	return func() gocql.SelectedHost {
		host := hostIter()
		for host == nil {
			time.Sleep(time.Second) // without sleep this will become active waiting
			hostIter = p.HostPolicy.Pick(q)
			host = hostIter()
		}
		return host
	}
}

func (p *EndlessHostSelectionPolicy) IsLocal(host *gocql.HostInfo) bool {
	return p.HostPolicy.IsLocal(host)
}

func (p *EndlessHostSelectionPolicy) Init(session *gocql.Session) {
	p.HostPolicy.Init(session)
}

func (p *EndlessHostSelectionPolicy) KeyspaceChanged(ev gocql.KeyspaceUpdateEvent) {
	p.HostPolicy.KeyspaceChanged(ev)
}

func (p *EndlessHostSelectionPolicy) SetPartitioner(partitioner string) {
	p.HostPolicy.SetPartitioner(partitioner)
}

func (p *EndlessHostSelectionPolicy) AddHost(host *gocql.HostInfo) {
	p.HostPolicy.AddHost(host)
}

func (p *EndlessHostSelectionPolicy) RemoveHost(host *gocql.HostInfo) {
	p.HostPolicy.RemoveHost(host)
}

func (p *EndlessHostSelectionPolicy) HostUp(host *gocql.HostInfo) {
	p.HostPolicy.HostUp(host)
}

func (p *EndlessHostSelectionPolicy) HostDown(host *gocql.HostInfo) {
	p.HostPolicy.HostDown(host)
}

Sep 29 '25 11:09 m0t9