featurebase
featurebase copied to clipboard
Queries interfere with gossip, causing cluster to die
For bugs, please provide the following:
What's going wrong?
It seems that when we start querying cluster - it interferes\contends for resources with nodes ability to answer gossip probes and cluster jumps from NORMAL to STARTING and back frequently, causing queries to error out. it is not caused by network being a bottleneck since traffic is low, around 50kbs, which is nothing for available capacity.
What was expected?
Queries should not interfere with cluster ability to communicate. Think gossip communication and cluster ability to stay alive should come first when distributing available resources, would be nice to have a handle on this
Steps to reproduce the behavior
my gossip configs are:
[gossip]
key = \"\"
port = \"14000\"
probe-interval = \"4s\"
probe-timeout = \"2s\"
push-pull-interval = \"30s\"
seeds = [\"10.8.222.185:14000\",\"10.8.249.29:14000\"]
stream-timeout = \"10s\"
suspicion-mult = 4
to-the-dead-time = \"30s\"
Included merged logs from my 7 boxes ( started queries at 10:58 - immediately see issues with gissip, cluster starts to jump back and forth between states, this behavior persists until query load stops ):
Information about your environment (OS/architecture, CPU, RAM, cluster/solo, configuration, etc.)
commit - https://github.com/pilosa/pilosa/commit/5bcb00e11a746a988dfbca3f5b8c55d3c21fbd54
Yeah, we definitely need to figure out why this happens... it's a bit strange. You might try increasing the suspicion-mult which will give the memberlist stuff a bit more time to refute dropped nodes, and hopefully avoid the cluster changing state.
I'm going to hypothesize, though, that there is some kind of blocking operation happening that stops gossip from being able to share state while queries are going on, I'll look into this.
@dmibor would love to know if this issue has still be affecting you all and if the latest changes fix/improve it. It's been a fairly complex thing to track down, and seems like there are multiple paths for causing it.
@jaffee Hey! will definitely keep an eye on that one starting next month when we would again install master.
p.s.: sorry for some time of silence - was on a month long vacation
@jaffee See the same behavior from time to time after installing master ( https://github.com/pilosa/pilosa/commit/a5aa6e48a50ab34cd3a1136147c12fb145f95b14 ) - heavily loaded node loses connectivity and kills the cluster for up to 10 secs Hard to say about was it improved or not due to sporadic nature of the thing... Prob improved but not completely fixed... On top of that in addition to error message response "cluster in STARTING state" started to see this:
400 Bad Request - {"error":"executing: starting mapper: shards by node: shard unavailable"}
Query itself is fine, since retry of the same query works with no issues
@jaffee can we open this back? Issue is still there, keep seeing it periodically. Right now workaround it by putting ton of retries in everything that queries Pilosa frequently, would be nice if that would no longer be needed...
Yeah, no problem. Is it GROUP BY queries which seem to cause problems? I imagine this will be helped by reducing their memory footprint, but ultimately I think we'd like to replace memberlist with something that's a better fit for our usage.
I've got a crazy writeup of what I think causes the problem that reads like a conspiracy theory, involving Go's scheduler and GC when running highly CPU bound code with large heaps, but...
Lately see it for Count() queries mostly. We almost never use GroupBy atm unfortunately, simulate it with Count() s...