query retries rarely cause retries because hostpoolpolicy aborts query execution
So I don't know if this works as designed or not. At least I found things counter-intuitive and spent quite some time troubleshooting what's going on.
In my client app I've implemented a config option to enable query retries, which just enables a SimpleRetryPolicy within gocql. I then use toxiproxy to artifically slow down cassandra, somewhat randomly, so that some queries won't be slowed down much, while other are slower than the timeout. Note that in my testing i just use 1 cassandra instance.
As I've tested this, with higher and higher query retry settings, i notice that only some queries are retried, and only a limited amount, definitely not to the amount permitted by the SimpleRetryPolicy. And so they still return timeout errors, even though could have been retried more often and returned a valid result instead.
The reason is that queryExecutor.executeQuery(qry ExecutableQuery) stops attempting a query when the hostpool Pick() function returns nil, instead of a valid host.
I thought I had read somewhere that when a hostpool becomes empty (e.g. all hosts timed out), then they would all be re-added again, so that the query could be retried again. But this does not seem to be case.
I verified this with both the round-robin host selection policy as well tokenAware with a simple hostpool backend. (I know tokenaware doesn't effect anything here because i have 1 node, presumably this is equivalent to using hostpool-simple directly)
with NumRetries 3/4/5/6/... and round-robin i saw that many queries were only tried twice (retried once) and timed out, and with tokenaware+hostpool-simple only 2 were tried twice, resulting in a timeout and "too many timeouts" errors, all queries afterwards were only tried once.
is this normal?
Yep, I realised whilst refactoring to add the queryExecutor that the retry policy interface is severely limiting, I had created #735.
Some ideas spring to mind
- Eagerly consume from the host iterator
- Improve retry policy interface to return next host for a given error + query.
- Make retry policy take a list of hosts or a HostIter, then have it return a query plan with methods like,
type RetryPolicy interface {
QueryPlan(hosts []HostInfo) QueryPlan
}
type QueryPlan interface {
NextHost(err) (HostInfo, bool)
}
Then the query plan can have a view of all hosts it can return for a query, it can decide if the query should be tried again on the same host, same rack, same dc, cross dc. Or if the error should not be retried at all.
Something I can see happening is host priority, the host selector is returning a list of hosts in some priority, the retry policy needs to respect that as well.
Isn't it a problem that by the time a query needs to be retried the host priority list may be out of date ? Hosts may have gone down, marked slow, etc. Maybe it needs to refresh the list at every try ? Seems like host selection and retry policy are very intertwined matters, maybe we should just merge them into one thing. E.g. the host policies could contain the code for how/when to retry queries.
true, but that greatly complicates the policies and makes them much harder to implement. I don't think its entirely a problem as if the retry policy is passed an []*HostInfo then it will know if a host is down due to the shared HostInfo, but wont get hosts that are up. Potentially we could returned downed hosts from the host selector which can be passed through.
I don't entirely see it being a problem as query retry will be taking place in sub 50ms.
if the retry policy is passed an []*HostInfo then it will know if a host is down due to the shared HostInfo, but wont get hosts that are up. Potentially we could returned downed hosts from the host selector which can be passed through.
But hostSelectionPolicies can be more finegrained AFAIK and use a more nuanced criterium such as latency, rather then binary up/down. Isn't the epsilon-greedy hostpool an example of this? AFAIK at every single query it may adjust its preferences. (with hundreds of qps that means every few ms). Also if host was dead at first, but then became live again and a connection was re-established, i would want to incorporate it into the retries.
I don't entirely see it being a problem as query retry will be taking place in sub 50ms
maybe if all your queries return within 50ms. we definitely see longer running queries that take >1000 ms, and in fact the reason for retries in our environment is usually due to hitting our timeouts limit. we're still experimenting with the various settings but at this point we have timouts 2000~3000ms and our queries are hitting that (unfortunately).
@Dieterbe, could you please comment on the issue? What is the status of it? There have been a lot of changes in the driver since the issue was created. Is it still relevant for you?
We have migrated away from Cassandra a few years ago, so we haven't used gocql since then
We've also encountered this problem when implementing unlimited retries to Cassandra, required by our business logic. It is not enough to only enable retries due to the fact of single hosts traversal (e.g.,here).
You can enable retries without changing queried host. This way you will be limited by only your RetryPolicy options like number of attempts. But obviously this is not so reliable
We've implemented a wrapper for HostSelectionPolicy in order to wait for available hosts forever (but, sure, you can change it). Not the best implementation, but suitable for any host selection policy.
Can be used for instance this way
HostSelectionPolicy: &EndlessHostSelectionPolicy{gocql.RoundRobinHostPolicy()}
And the implementation
type EndlessHostSelectionPolicy struct {
HostPolicy gocql.HostSelectionPolicy
}
var _ gocql.HostSelectionPolicy = (*EndlessHostSelectionPolicy)(nil)
func (p *EndlessHostSelectionPolicy) Pick(q gocql.ExecutableQuery) gocql.NextHost {
hostIter := p.HostPolicy.Pick(q)
return func() gocql.SelectedHost {
host := hostIter()
for host == nil {
time.Sleep(time.Second) // without sleep this will become active waiting
hostIter = p.HostPolicy.Pick(q)
host = hostIter()
}
return host
}
}
func (p *EndlessHostSelectionPolicy) IsLocal(host *gocql.HostInfo) bool {
return p.HostPolicy.IsLocal(host)
}
func (p *EndlessHostSelectionPolicy) Init(session *gocql.Session) {
p.HostPolicy.Init(session)
}
func (p *EndlessHostSelectionPolicy) KeyspaceChanged(ev gocql.KeyspaceUpdateEvent) {
p.HostPolicy.KeyspaceChanged(ev)
}
func (p *EndlessHostSelectionPolicy) SetPartitioner(partitioner string) {
p.HostPolicy.SetPartitioner(partitioner)
}
func (p *EndlessHostSelectionPolicy) AddHost(host *gocql.HostInfo) {
p.HostPolicy.AddHost(host)
}
func (p *EndlessHostSelectionPolicy) RemoveHost(host *gocql.HostInfo) {
p.HostPolicy.RemoveHost(host)
}
func (p *EndlessHostSelectionPolicy) HostUp(host *gocql.HostInfo) {
p.HostPolicy.HostUp(host)
}
func (p *EndlessHostSelectionPolicy) HostDown(host *gocql.HostInfo) {
p.HostPolicy.HostDown(host)
}