mimir icon indicating copy to clipboard operation
mimir copied to clipboard

Store-Gateway: Reduce generated garbage when loading chunks from bucket block

Open ortuman opened this issue 1 year ago • 3 comments

What this PR does

Motivation of this PR is to reduce total amount of generated garbage when populating chunks from a bucket block.

The profile

Below there's a screenshot of a storegateway memory profile corresponding to cortex-prod-01 cell.

Screenshot 2022-10-14 at 11 55 31

According to this, roughly 26% of total allocated objects are generated while invoking loadChunks, and about the 22% is distributed between the functions populateChunk and loadChunk itself.

store-gateway.cortex-prod-01.mem.profile.zip

The changes

When listing populateChunk we observe the following output:

(pprof) list populateChunk
Total: 16800841
ROUTINE ======================== github.com/grafana/mimir/pkg/storegateway.populateChunk in /backend-enterprise/vendor/github.com/grafana/mimir/pkg/storegateway/bucket.go
   1884218    1898590 (flat, cum) 11.30% of Total
         .          .    792:	return ps, stats
         .          .    793:}
         .          .    794:
         .          .    795:func populateChunk(out *storepb.AggrChunk, in chunkenc.Chunk, aggrs []storepb.Aggr, save func([]byte) ([]byte, error)) error {
         .          .    796:	if in.Encoding() == chunkenc.EncXOR {
         .      14372    797:		b, err := save(in.Bytes())
         .          .    798:		if err != nil {
         .          .    799:			return err
         .          .    800:		}
   1884218    1884218    801:		out.Raw = &storepb.Chunk{Type: storepb.Chunk_XOR, Data: b}
         .          .    802:		return nil
         .          .    803:	}
         .          .    804:	return errors.Errorf("unsupported chunk encoding %d", in.Encoding())
         .          .    805:}
         .          .    806:
(pprof)

Here we can see that most of allocated objects in this function correspond to the storepb.Chunk object type.

To mitigitate this, as part of this PR we've included an extra pool, shared across block readers, to recycle this type of objects.

Much more interesting is the content of the listing for the loadChunks function:

(pprof) list loadChunks
Total: 16800841
ROUTINE ======================== github.com/grafana/mimir/pkg/storegateway.(*bucketChunkReader).loadChunks in /backend-enterprise/vendor/github.com/grafana/mimir/pkg/storegateway/bucket.go
   1792365    4389100 (flat, cum) 26.12% of Total
[...]
         .          .   2595:		cb := buf[:chunkLen]
         .          .   2596:		n, err = io.ReadFull(bufReader, cb)
         .          .   2597:		readOffset += n
         .          .   2598:		// Unexpected EOF for last chunk could be a valid case. Any other errors are definitely real.
         .          .   2599:		if err != nil && !(errors.Is(err, io.ErrUnexpectedEOF) && i == len(pIdxs)-1) {
         .          .   2600:			return errors.Wrapf(err, "read range for seq %d offset %x", seq, pIdx.offset)
         .          .   2601:		}
         .          .   2602:
         .          .   2603:		chunkDataLen, n := binary.Uvarint(cb)
         .          .   2604:		if n < 1 {
         .          .   2605:			return errors.New("reading chunk length failed")
         .          .   2606:		}
         .          .   2607:
         .          .   2608:		// Chunk length is n (number of bytes used to encode chunk data), 1 for chunk encoding and chunkDataLen for actual chunk data.
         .          .   2609:		// There is also crc32 after the chunk, but we ignore that.
         .          .   2610:		chunkLen = n + 1 + int(chunkDataLen)
         .          .   2611:		if chunkLen <= len(cb) {
   1791358    3689948   2612:			err = populateChunk(&(res[pIdx.seriesEntry].chks[pIdx.chunk]), rawChunk(cb[n:chunkLen]), aggrs, r.save)
         .          .   2613:			if err != nil {
         .          .   2614:				return errors.Wrap(err, "populate chunk")
         .          .   2615:			}
         .          .   2616:			r.stats.chunksTouched++
         .          .   2617:			r.stats.chunksTouchedSizeSum += int(chunkDataLen)
(pprof)

I don't quite understand why (maybe @bboreham could shed some ligth on this) but for some reason that escapes me the rawChunk helper type is being sent to the heap before passing it to populateChunk.

To fix this, we've adapted populateChunk to receive a byte slice instead, doing the cast inside the function and thus avoiding the unnecessary allocation.

The results

To validate this changes we've resorted to BenchmarkBlockSeries by running the following command:

go test -run=X -bench=BenchmarkBlockSeries -benchmem -count=10 -cpu=8 ./pkg/storegateway

Here's the comparison between the old and the new implementation:

name                                                          old time/op    new time/op    delta
BlockSeries/concurrency:_1,_query_sharding_enabled:_false-8     4.74ms ± 2%    4.12ms ± 1%  -13.22%  (p=0.000 n=10+9)
BlockSeries/concurrency:_1,_query_sharding_enabled:_true-8      1.56ms ± 0%    1.45ms ± 1%   -6.99%  (p=0.000 n=9+9)
BlockSeries/concurrency:_2,_query_sharding_enabled:_false-8     3.03ms ± 2%    2.79ms ± 1%   -7.81%  (p=0.000 n=10+10)
BlockSeries/concurrency:_2,_query_sharding_enabled:_true-8       966µs ± 0%     914µs ± 2%   -5.43%  (p=0.000 n=9+10)
BlockSeries/concurrency:_4,_query_sharding_enabled:_false-8     1.88ms ± 3%    1.78ms ± 3%   -5.45%  (p=0.000 n=9+10)
BlockSeries/concurrency:_4,_query_sharding_enabled:_true-8       722µs ± 1%     694µs ± 1%   -3.90%  (p=0.000 n=9+9)
BlockSeries/concurrency:_8,_query_sharding_enabled:_false-8     1.89ms ± 3%    1.81ms ± 9%   -4.48%  (p=0.009 n=10+10)
BlockSeries/concurrency:_8,_query_sharding_enabled:_true-8       745µs ±10%     709µs ± 2%   -4.94%  (p=0.006 n=10+8)
BlockSeries/concurrency:_16,_query_sharding_enabled:_false-8    1.85ms ± 4%    1.77ms ± 4%   -4.06%  (p=0.003 n=9+10)
BlockSeries/concurrency:_16,_query_sharding_enabled:_true-8      710µs ± 1%     700µs ± 2%   -1.43%  (p=0.029 n=10+10)
BlockSeries/concurrency:_32,_query_sharding_enabled:_false-8    2.66ms ± 8%    2.47ms ±12%   -6.98%  (p=0.013 n=9+10)
BlockSeries/concurrency:_32,_query_sharding_enabled:_true-8      707µs ± 1%     699µs ± 3%   -1.14%  (p=0.040 n=9+9)

name                                                          old alloc/op   new alloc/op   delta
BlockSeries/concurrency:_1,_query_sharding_enabled:_false-8     4.67MB ± 5%    4.01MB ± 8%  -14.07%  (p=0.000 n=9+10)
BlockSeries/concurrency:_1,_query_sharding_enabled:_true-8      1.49MB ± 0%    1.38MB ± 1%   -7.66%  (p=0.000 n=9+10)
BlockSeries/concurrency:_2,_query_sharding_enabled:_false-8     4.90MB ± 5%    3.89MB ± 3%  -20.75%  (p=0.000 n=10+10)
BlockSeries/concurrency:_2,_query_sharding_enabled:_true-8      1.51MB ± 0%    1.42MB ± 1%   -5.77%  (p=0.000 n=10+10)
BlockSeries/concurrency:_4,_query_sharding_enabled:_false-8     4.80MB ± 6%    3.84MB ± 1%  -20.03%  (p=0.000 n=9+8)
BlockSeries/concurrency:_4,_query_sharding_enabled:_true-8      1.52MB ± 1%    1.41MB ± 1%   -7.04%  (p=0.000 n=10+10)
BlockSeries/concurrency:_8,_query_sharding_enabled:_false-8     4.98MB ± 4%    3.96MB ± 8%  -20.51%  (p=0.000 n=10+10)
BlockSeries/concurrency:_8,_query_sharding_enabled:_true-8      1.52MB ± 1%    1.39MB ± 1%   -8.08%  (p=0.000 n=10+10)
BlockSeries/concurrency:_16,_query_sharding_enabled:_false-8    4.88MB ±15%    3.94MB ± 8%  -19.43%  (p=0.000 n=10+10)
BlockSeries/concurrency:_16,_query_sharding_enabled:_true-8     1.54MB ± 1%    1.41MB ± 2%   -8.68%  (p=0.000 n=10+10)
BlockSeries/concurrency:_32,_query_sharding_enabled:_false-8    7.05MB ±16%    5.41MB ± 5%  -23.23%  (p=0.000 n=10+8)
BlockSeries/concurrency:_32,_query_sharding_enabled:_true-8     1.57MB ± 2%    1.44MB ± 1%   -8.13%  (p=0.000 n=10+9)

name                                                          old allocs/op  new allocs/op  delta
BlockSeries/concurrency:_1,_query_sharding_enabled:_false-8      29.7k ± 1%      2.2k ± 1%  -92.54%  (p=0.000 n=10+10)
BlockSeries/concurrency:_1,_query_sharding_enabled:_true-8       5.65k ± 0%     0.85k ± 0%  -84.96%  (p=0.000 n=9+10)
BlockSeries/concurrency:_2,_query_sharding_enabled:_false-8      30.3k ± 0%      2.2k ± 0%  -92.58%  (p=0.000 n=8+9)
BlockSeries/concurrency:_2,_query_sharding_enabled:_true-8       5.64k ± 0%     0.84k ± 0%  -85.04%  (p=0.000 n=10+7)
BlockSeries/concurrency:_4,_query_sharding_enabled:_false-8      30.2k ± 3%      2.2k ± 1%  -92.55%  (p=0.000 n=10+10)
BlockSeries/concurrency:_4,_query_sharding_enabled:_true-8       5.64k ± 0%     0.85k ± 0%  -85.00%  (p=0.000 n=10+7)
BlockSeries/concurrency:_8,_query_sharding_enabled:_false-8      30.5k ± 1%      2.2k ± 2%  -92.67%  (p=0.000 n=9+10)
BlockSeries/concurrency:_8,_query_sharding_enabled:_true-8       5.65k ± 0%     0.86k ± 0%  -84.84%  (p=0.000 n=10+8)
BlockSeries/concurrency:_16,_query_sharding_enabled:_false-8     30.3k ± 6%      2.2k ± 3%  -92.71%  (p=0.000 n=10+10)
BlockSeries/concurrency:_16,_query_sharding_enabled:_true-8      5.67k ± 1%     0.87k ± 0%  -84.63%  (p=0.000 n=10+8)
BlockSeries/concurrency:_32,_query_sharding_enabled:_false-8     43.8k ± 5%      2.9k ± 2%  -93.47%  (p=0.000 n=9+8)
BlockSeries/concurrency:_32,_query_sharding_enabled:_true-8      5.72k ± 1%     0.90k ± 1%  -84.19%  (p=0.000 n=9+9)

According to this, with the new implementation the number of total allocations is reduced by a ~90%.

Worth mentioning, that operation spent time seems to have worsened. However, I think it's fair to attribute this to noise, as times barely differ in ~1ms.

Which issue(s) this PR fixes or relates to

Fixes N/A

Checklist

  • [X] Tests updated
  • [ ] Documentation added
  • [X] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

ortuman avatar Oct 14 '22 11:10 ortuman

Worth mentioning, that operation spent time seems to have worsened. However, I think it's fair to attribute this to noise, as times barely differ in ~1ms.

I think that millisecond scale is not as neglible as it can be micro/nanosecond, can we try understand where that increase comes from? One of the things I'd blame is that async put into pool.

colega avatar Oct 17 '22 10:10 colega

I think that millisecond scale is not as neglible as it can be micro/nanosecond, can we try understand where that increase comes from? One of the things I'd blame is that async put into pool.

Got some CPU profile from the benchmark, and it seems that a significant amount of time is now spent in the pool Put method.

This makes sense, given that in some benchmark scenarios the total number of returned chunks is around 500k. 😱

Screenshot 2022-10-17 at 15 22 18

Maybe establishing some max returned objects limit here is not a bad option after all. wdyt?

ortuman avatar Oct 17 '22 13:10 ortuman

In the end, I decided to adapt chunks pool to contain slices with a total capacity of 10k chunks per slice, rather than managing plain storepb.Chunk objects.

And now benchmark results look better (see the updated comparison in the PR description).

This reduces drastically total required pool invocations, so I've also decided to remove the async method.

ortuman avatar Oct 18 '22 09:10 ortuman

That's an interesting approach, where did get the 10k number? Or is it just a big but not excessively huge approach?

colega avatar Oct 18 '22 10:10 colega

Or is it just a big but not excessively huge approach?

That's the idea. Though still an arbitrary number, susceptible to be changed.

I'll add some comment to clarify. 👍

UPDATE: I changed this limit to 16384, cause I ❤️ power of two numbers.

ortuman avatar Oct 18 '22 13:10 ortuman

The changes LGTM, I'd still like to discuss whether it makes sense to have those pools per-block or they should be global, let's see if @pracucci & @bboreham can say something on that.

colega avatar Oct 20 '22 13:10 colega