featurebase Queries interfere with gossip, causing cluster to die

For bugs, please provide the following:

What's going wrong?

It seems that when we start querying cluster - it interferes\contends for resources with nodes ability to answer gossip probes and cluster jumps from NORMAL to STARTING and back frequently, causing queries to error out. it is not caused by network being a bottleneck since traffic is low, around 50kbs, which is nothing for available capacity.

What was expected?

Queries should not interfere with cluster ability to communicate. Think gossip communication and cluster ability to stay alive should come first when distributing available resources, would be nice to have a handle on this

Steps to reproduce the behavior

my gossip configs are:

[gossip]
  key = \"\"
  port = \"14000\"
  probe-interval = \"4s\"
  probe-timeout = \"2s\"
  push-pull-interval = \"30s\"
  seeds = [\"10.8.222.185:14000\",\"10.8.249.29:14000\"]
  stream-timeout = \"10s\"
  suspicion-mult = 4
  to-the-dead-time = \"30s\"

Included merged logs from my 7 boxes ( started queries at 10:58 - immediately see issues with gissip, cluster starts to jump back and forth between states, this behavior persists until query load stops ):

logs.log

Information about your environment (OS/architecture, CPU, RAM, cluster/solo, configuration, etc.)

commit - https://github.com/pilosa/pilosa/commit/5bcb00e11a746a988dfbca3f5b8c55d3c21fbd54

May 10 '19 04:05 dmibor

Yeah, we definitely need to figure out why this happens... it's a bit strange. You might try increasing the suspicion-mult which will give the memberlist stuff a bit more time to refute dropped nodes, and hopefully avoid the cluster changing state.

I'm going to hypothesize, though, that there is some kind of blocking operation happening that stops gossip from being able to share state while queries are going on, I'll look into this.

May 10 '19 17:05 jaffee

@dmibor would love to know if this issue has still be affecting you all and if the latest changes fix/improve it. It's been a fairly complex thing to track down, and seems like there are multiple paths for causing it.

Jul 15 '19 20:07 jaffee

@jaffee Hey! will definitely keep an eye on that one starting next month when we would again install master.

p.s.: sorry for some time of silence - was on a month long vacation

Jul 22 '19 03:07 dmibor

@jaffee See the same behavior from time to time after installing master ( https://github.com/pilosa/pilosa/commit/a5aa6e48a50ab34cd3a1136147c12fb145f95b14 ) - heavily loaded node loses connectivity and kills the cluster for up to 10 secs Hard to say about was it improved or not due to sporadic nature of the thing... Prob improved but not completely fixed... On top of that in addition to error message response "cluster in STARTING state" started to see this:

400 Bad Request - {"error":"executing: starting mapper: shards by node: shard unavailable"}

Query itself is fine, since retry of the same query works with no issues

Aug 13 '19 03:08 dmibor

@jaffee can we open this back? Issue is still there, keep seeing it periodically. Right now workaround it by putting ton of retries in everything that queries Pilosa frequently, would be nice if that would no longer be needed...

Sep 13 '19 03:09 dmibor

Yeah, no problem. Is it GROUP BY queries which seem to cause problems? I imagine this will be helped by reducing their memory footprint, but ultimately I think we'd like to replace memberlist with something that's a better fit for our usage.

I've got a crazy writeup of what I think causes the problem that reads like a conspiracy theory, involving Go's scheduler and GC when running highly CPU bound code with large heaps, but...

Sep 13 '19 03:09 jaffee

Lately see it for Count() queries mostly. We almost never use GroupBy atm unfortunately, simulate it with Count() s...

Sep 13 '19 03:09 dmibor

featurebase featurebase copied to clipboard

Queries interfere with gossip, causing cluster to die

What's going wrong?

What was expected?

Steps to reproduce the behavior

Information about your environment (OS/architecture, CPU, RAM, cluster/solo, configuration, etc.)

featurebase
featurebase copied to clipboard