featurebase
featurebase copied to clipboard
OOM and overall high memory churn for small query load on large index
What's going wrong?
Pilosa OOMs under very low query load (10-15 Count() queries per second) on a large index. Index holds around 350 GBs in heap. Cluster has 600 GBs of memory total ( 5x120GB ) - so there is quite a lot of room for mmapping files and unavoidable churn, but it seems like it's far from being enough even when I ask just for counts not even for IDs...
Attaching 5 heap profile files for every node done right after OOM and restart on node4. heaps.zip
P.s.: Interesting quick win might be here - jaeger seems to be responsible for at least 35% of all allocations, can I disable it?
$ go tool pprof -alloc_space /tmp/node1
File: pilosa
Type: alloc_space
Time: Mar 11, 2019 at 12:17pm (+08)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top40
Showing nodes accounting for 13649.93GB, 96.10% of 14203.57GB total
Dropped 651 nodes (cum <= 71.02GB)
Showing top 40 nodes out of 95
flat flat% sum% cum cum%
2786.42GB 19.62% 19.62% 2786.42GB 19.62% github.com/uber/jaeger-client-go.(*Tracer).newSpan
1880.79GB 13.24% 32.86% 1880.79GB 13.24% bytes.makeSlice
927.85GB 6.53% 39.39% 3714.27GB 26.15% github.com/uber/jaeger-client-go.(*Tracer).startSpanWithOptions
830.48GB 5.85% 45.24% 921.97GB 6.49% github.com/pilosa/pilosa.countOpenFiles
742.93GB 5.23% 50.47% 742.95GB 5.23% github.com/uber/jaeger-client-go.(*Span).Context
741.70GB 5.22% 55.69% 4828.11GB 33.99% github.com/uber/jaeger-client-go.(*Tracer).StartSpan
556.93GB 3.92% 59.61% 556.93GB 3.92% context.WithValue
556.58GB 3.92% 63.53% 6684.11GB 47.06% github.com/pilosa/pilosa/tracing/opentracing.(*Tracer).StartSpanFromContext
484.80GB 3.41% 66.94% 484.80GB 3.41% github.com/pilosa/pilosa.(*Row).Union
416.88GB 2.94% 69.88% 416.88GB 2.94% github.com/DataDog/datadog-go/statsd.(*Client).format
382.69GB 2.69% 72.57% 382.69GB 2.69% github.com/DataDog/datadog-go/statsd.copyAndResetBuffer (inline)
372.11GB 2.62% 75.19% 372.11GB 2.62% github.com/opentracing/opentracing-go.SpanReference.Apply
332.01GB 2.34% 77.53% 1598.57GB 11.25% github.com/pilosa/pilosa.(*Row).Intersect
277.22GB 1.95% 79.48% 4377.09GB 30.82% github.com/pilosa/pilosa.(*executor).executeRowShard
260.17GB 1.83% 81.31% 260.17GB 1.83% github.com/pilosa/pilosa/roaring.intersectArrayBitmap
245.47GB 1.73% 83.04% 245.47GB 1.73% github.com/pilosa/pilosa/roaring.intersectArrayArray
171.23GB 1.21% 84.25% 171.23GB 1.21% github.com/pilosa/pilosa/roaring.(*sliceContainers).insertAt (inline)
168.03GB 1.18% 85.43% 419.44GB 2.95% github.com/pilosa/pilosa.(*executor).shardsByNode
149.72GB 1.05% 86.49% 149.72GB 1.05% github.com/pilosa/pilosa/roaring.newSliceContainers (inline)
149.07GB 1.05% 87.53% 149.07GB 1.05% github.com/pilosa/pilosa/roaring.(*sliceContainers).Iterator
What was expected?
Pilosa could handle couple of orders of magnitude higher query load
Steps to reproduce the behavior
You can set tracing.sampler-type to const and tracing.sampler-param to 0 https://www.pilosa.com/docs/latest/configuration/#tracing-sampler-type
The defaults are pretty low though, so I'm not sure how helpful this will be - we may have to implement more of a "hard switch" for disabling tracing.
Can you send the exact source commit hash that goes with these profiles? I like using the list
command to see line-by-line in the source.
https://github.com/pilosa/pilosa/commit/11fe06be85b5aac2273534be9e5fe69e5171a260
this should be helped significantly by the container shrinking work—let us know
@jaffee, @seebs - did some tests related to this issue, https://github.com/pilosa/pilosa/issues/1899 and https://github.com/pilosa/pilosa/issues/1894. As expected, after container format change heap needed to hold fragment bitmaps metadata decreased 2,5 times freeing more room for Pilosa to handle higher query load, we see definite improvement and expansion of what Pilosa can survive now:
To set the scene of our data params - we have about 10 different fields in our main index, most of them are quite small, ranging from 2 to 500 in cardinality, but one field, "segment" - the most interesting and the most problematic for Pilosa - is about 78000 in cardinality currently. On top of that we have about 5000 shards distributed over 7 boxes with 120GBs of memory each.
1 - when trying GroupBy before the change - cluster started to OOM if we use 2 and more small fields in a single GroupBy. e.g. we couldn't even complete GroupBy with 250 * 500 groups. After the container format change GroupBy finally was able to show a lot of utility, any GroupBy that does not involve segment field was able to fit into available memory and show it's value in speed - it is at least 4 times faster than if we query manually group-by-group. Unfortunately, even after the change Pilosa still can't stomach any GroupBy that involves segment field. related to my comments in https://github.com/pilosa/pilosa/issues/1815 about GroupBy memory footprint.
2 - when trying other workload of ours that has form of a lot of Count(Intersect(Row(), Row(),....)) queries for all sorts of combinations of rows in different fields - we see similar improvements.
Max query load Pilosa was able to handle increased almost 3 times - from 10-15 to 35-40 queries per second which was super nice to see!
Caveat here is that, again, if we use segment field it still OOMs eventually, can't sustainably handle query load, pictures of memory usage and OOMs like that I see all the time if query load doesn't decrease and Go doesn't free and return memory:
That is were I was coming from when discussing idea to use swap as resiliency and throttling tool. For the sake of test we took the same queries from picture above and replayed them with 100GBs of swap on all boxes - this replay memory looks like that:
cluster does not die. Query throughput almost didn't change too - without swap this test showed exactly 36 queries per second average throughput, with swap it showed 32 queries per second. We would continue trying it out, but It seems to me using swap might bring a ton of stability for Pilosa working with huge datasets.
Also, think these issues with high cardinality field on a dataset with a lot of shards show how very much interesting is the work you talked about to implement special container type for high cardinality fields as well as ability not to read all fragment bitmap into memory but work with it on a more granular level, only read small parts\rows of it. We are at the point now that we are starting to think if it is worth to hack segment field into pieces-smaller fields so that it doesn't produce such a high memory pressure ...