featurebase icon indicating copy to clipboard operation
featurebase copied to clipboard

OOM and overall high memory churn for small query load on large index

Open dmibor opened this issue 5 years ago • 4 comments

What's going wrong?

Pilosa OOMs under very low query load (10-15 Count() queries per second) on a large index. Index holds around 350 GBs in heap. Cluster has 600 GBs of memory total ( 5x120GB ) - so there is quite a lot of room for mmapping files and unavoidable churn, but it seems like it's far from being enough even when I ask just for counts not even for IDs...

Attaching 5 heap profile files for every node done right after OOM and restart on node4. heaps.zip

P.s.: Interesting quick win might be here - jaeger seems to be responsible for at least 35% of all allocations, can I disable it?

$ go tool pprof -alloc_space /tmp/node1
File: pilosa
Type: alloc_space
Time: Mar 11, 2019 at 12:17pm (+08)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top40
Showing nodes accounting for 13649.93GB, 96.10% of 14203.57GB total
Dropped 651 nodes (cum <= 71.02GB)
Showing top 40 nodes out of 95
      flat  flat%   sum%        cum   cum%
 2786.42GB 19.62% 19.62%  2786.42GB 19.62%  github.com/uber/jaeger-client-go.(*Tracer).newSpan
 1880.79GB 13.24% 32.86%  1880.79GB 13.24%  bytes.makeSlice
  927.85GB  6.53% 39.39%  3714.27GB 26.15%  github.com/uber/jaeger-client-go.(*Tracer).startSpanWithOptions
  830.48GB  5.85% 45.24%   921.97GB  6.49%  github.com/pilosa/pilosa.countOpenFiles
  742.93GB  5.23% 50.47%   742.95GB  5.23%  github.com/uber/jaeger-client-go.(*Span).Context
  741.70GB  5.22% 55.69%  4828.11GB 33.99%  github.com/uber/jaeger-client-go.(*Tracer).StartSpan
  556.93GB  3.92% 59.61%   556.93GB  3.92%  context.WithValue
  556.58GB  3.92% 63.53%  6684.11GB 47.06%  github.com/pilosa/pilosa/tracing/opentracing.(*Tracer).StartSpanFromContext
  484.80GB  3.41% 66.94%   484.80GB  3.41%  github.com/pilosa/pilosa.(*Row).Union
  416.88GB  2.94% 69.88%   416.88GB  2.94%  github.com/DataDog/datadog-go/statsd.(*Client).format
  382.69GB  2.69% 72.57%   382.69GB  2.69%  github.com/DataDog/datadog-go/statsd.copyAndResetBuffer (inline)
  372.11GB  2.62% 75.19%   372.11GB  2.62%  github.com/opentracing/opentracing-go.SpanReference.Apply
  332.01GB  2.34% 77.53%  1598.57GB 11.25%  github.com/pilosa/pilosa.(*Row).Intersect
  277.22GB  1.95% 79.48%  4377.09GB 30.82%  github.com/pilosa/pilosa.(*executor).executeRowShard
  260.17GB  1.83% 81.31%   260.17GB  1.83%  github.com/pilosa/pilosa/roaring.intersectArrayBitmap
  245.47GB  1.73% 83.04%   245.47GB  1.73%  github.com/pilosa/pilosa/roaring.intersectArrayArray
  171.23GB  1.21% 84.25%   171.23GB  1.21%  github.com/pilosa/pilosa/roaring.(*sliceContainers).insertAt (inline)
  168.03GB  1.18% 85.43%   419.44GB  2.95%  github.com/pilosa/pilosa.(*executor).shardsByNode
  149.72GB  1.05% 86.49%   149.72GB  1.05%  github.com/pilosa/pilosa/roaring.newSliceContainers (inline)
  149.07GB  1.05% 87.53%   149.07GB  1.05%  github.com/pilosa/pilosa/roaring.(*sliceContainers).Iterator

What was expected?

Pilosa could handle couple of orders of magnitude higher query load

Steps to reproduce the behavior

dmibor avatar Mar 11 '19 04:03 dmibor

You can set tracing.sampler-type to const and tracing.sampler-param to 0 https://www.pilosa.com/docs/latest/configuration/#tracing-sampler-type

The defaults are pretty low though, so I'm not sure how helpful this will be - we may have to implement more of a "hard switch" for disabling tracing.

Can you send the exact source commit hash that goes with these profiles? I like using the list command to see line-by-line in the source.

jaffee avatar Mar 11 '19 13:03 jaffee

https://github.com/pilosa/pilosa/commit/11fe06be85b5aac2273534be9e5fe69e5171a260

dmibor avatar Mar 11 '19 14:03 dmibor

this should be helped significantly by the container shrinking work—let us know

jaffee avatar Apr 09 '19 15:04 jaffee

@jaffee, @seebs - did some tests related to this issue, https://github.com/pilosa/pilosa/issues/1899 and https://github.com/pilosa/pilosa/issues/1894. As expected, after container format change heap needed to hold fragment bitmaps metadata decreased 2,5 times freeing more room for Pilosa to handle higher query load, we see definite improvement and expansion of what Pilosa can survive now:

To set the scene of our data params - we have about 10 different fields in our main index, most of them are quite small, ranging from 2 to 500 in cardinality, but one field, "segment" - the most interesting and the most problematic for Pilosa - is about 78000 in cardinality currently. On top of that we have about 5000 shards distributed over 7 boxes with 120GBs of memory each.

1 - when trying GroupBy before the change - cluster started to OOM if we use 2 and more small fields in a single GroupBy. e.g. we couldn't even complete GroupBy with 250 * 500 groups. After the container format change GroupBy finally was able to show a lot of utility, any GroupBy that does not involve segment field was able to fit into available memory and show it's value in speed - it is at least 4 times faster than if we query manually group-by-group. Unfortunately, even after the change Pilosa still can't stomach any GroupBy that involves segment field. related to my comments in https://github.com/pilosa/pilosa/issues/1815 about GroupBy memory footprint.

2 - when trying other workload of ours that has form of a lot of Count(Intersect(Row(), Row(),....)) queries for all sorts of combinations of rows in different fields - we see similar improvements.

Max query load Pilosa was able to handle increased almost 3 times - from 10-15 to 35-40 queries per second which was super nice to see!

Caveat here is that, again, if we use segment field it still OOMs eventually, can't sustainably handle query load, pictures of memory usage and OOMs like that I see all the time if query load doesn't decrease and Go doesn't free and return memory: oom

That is were I was coming from when discussing idea to use swap as resiliency and throttling tool. For the sake of test we took the same queries from picture above and replayed them with 100GBs of swap on all boxes - this replay memory looks like that: swap

cluster does not die. Query throughput almost didn't change too - without swap this test showed exactly 36 queries per second average throughput, with swap it showed 32 queries per second. We would continue trying it out, but It seems to me using swap might bring a ton of stability for Pilosa working with huge datasets.

Also, think these issues with high cardinality field on a dataset with a lot of shards show how very much interesting is the work you talked about to implement special container type for high cardinality fields as well as ability not to read all fragment bitmap into memory but work with it on a more granular level, only read small parts\rows of it. We are at the point now that we are starting to think if it is worth to hack segment field into pieces-smaller fields so that it doesn't produce such a high memory pressure ...

dmibor avatar Apr 23 '19 07:04 dmibor