mimir
mimir copied to clipboard
Store-Gateway: Reduce generated garbage when loading chunks from bucket block
What this PR does
Motivation of this PR is to reduce total amount of generated garbage when populating chunks from a bucket block.
The profile
Below there's a screenshot of a storegateway
memory profile corresponding to cortex-prod-01
cell.
According to this, roughly 26%
of total allocated objects are generated while invoking loadChunks
, and about the 22%
is distributed between the functions populateChunk
and loadChunk
itself.
store-gateway.cortex-prod-01.mem.profile.zip
The changes
When listing populateChunk
we observe the following output:
(pprof) list populateChunk
Total: 16800841
ROUTINE ======================== github.com/grafana/mimir/pkg/storegateway.populateChunk in /backend-enterprise/vendor/github.com/grafana/mimir/pkg/storegateway/bucket.go
1884218 1898590 (flat, cum) 11.30% of Total
. . 792: return ps, stats
. . 793:}
. . 794:
. . 795:func populateChunk(out *storepb.AggrChunk, in chunkenc.Chunk, aggrs []storepb.Aggr, save func([]byte) ([]byte, error)) error {
. . 796: if in.Encoding() == chunkenc.EncXOR {
. 14372 797: b, err := save(in.Bytes())
. . 798: if err != nil {
. . 799: return err
. . 800: }
1884218 1884218 801: out.Raw = &storepb.Chunk{Type: storepb.Chunk_XOR, Data: b}
. . 802: return nil
. . 803: }
. . 804: return errors.Errorf("unsupported chunk encoding %d", in.Encoding())
. . 805:}
. . 806:
(pprof)
Here we can see that most of allocated objects in this function correspond to the storepb.Chunk
object type.
To mitigitate this, as part of this PR we've included an extra pool, shared across block readers, to recycle this type of objects.
Much more interesting is the content of the listing for the loadChunks
function:
(pprof) list loadChunks
Total: 16800841
ROUTINE ======================== github.com/grafana/mimir/pkg/storegateway.(*bucketChunkReader).loadChunks in /backend-enterprise/vendor/github.com/grafana/mimir/pkg/storegateway/bucket.go
1792365 4389100 (flat, cum) 26.12% of Total
[...]
. . 2595: cb := buf[:chunkLen]
. . 2596: n, err = io.ReadFull(bufReader, cb)
. . 2597: readOffset += n
. . 2598: // Unexpected EOF for last chunk could be a valid case. Any other errors are definitely real.
. . 2599: if err != nil && !(errors.Is(err, io.ErrUnexpectedEOF) && i == len(pIdxs)-1) {
. . 2600: return errors.Wrapf(err, "read range for seq %d offset %x", seq, pIdx.offset)
. . 2601: }
. . 2602:
. . 2603: chunkDataLen, n := binary.Uvarint(cb)
. . 2604: if n < 1 {
. . 2605: return errors.New("reading chunk length failed")
. . 2606: }
. . 2607:
. . 2608: // Chunk length is n (number of bytes used to encode chunk data), 1 for chunk encoding and chunkDataLen for actual chunk data.
. . 2609: // There is also crc32 after the chunk, but we ignore that.
. . 2610: chunkLen = n + 1 + int(chunkDataLen)
. . 2611: if chunkLen <= len(cb) {
1791358 3689948 2612: err = populateChunk(&(res[pIdx.seriesEntry].chks[pIdx.chunk]), rawChunk(cb[n:chunkLen]), aggrs, r.save)
. . 2613: if err != nil {
. . 2614: return errors.Wrap(err, "populate chunk")
. . 2615: }
. . 2616: r.stats.chunksTouched++
. . 2617: r.stats.chunksTouchedSizeSum += int(chunkDataLen)
(pprof)
I don't quite understand why (maybe @bboreham could shed some ligth on this) but for some reason that escapes me the rawChunk
helper type is being sent to the heap before passing it to populateChunk
.
To fix this, we've adapted populateChunk
to receive a byte slice instead, doing the cast inside the function and thus avoiding the unnecessary allocation.
The results
To validate this changes we've resorted to BenchmarkBlockSeries
by running the following command:
go test -run=X -bench=BenchmarkBlockSeries -benchmem -count=10 -cpu=8 ./pkg/storegateway
Here's the comparison between the old and the new implementation:
name old time/op new time/op delta
BlockSeries/concurrency:_1,_query_sharding_enabled:_false-8 4.74ms ± 2% 4.12ms ± 1% -13.22% (p=0.000 n=10+9)
BlockSeries/concurrency:_1,_query_sharding_enabled:_true-8 1.56ms ± 0% 1.45ms ± 1% -6.99% (p=0.000 n=9+9)
BlockSeries/concurrency:_2,_query_sharding_enabled:_false-8 3.03ms ± 2% 2.79ms ± 1% -7.81% (p=0.000 n=10+10)
BlockSeries/concurrency:_2,_query_sharding_enabled:_true-8 966µs ± 0% 914µs ± 2% -5.43% (p=0.000 n=9+10)
BlockSeries/concurrency:_4,_query_sharding_enabled:_false-8 1.88ms ± 3% 1.78ms ± 3% -5.45% (p=0.000 n=9+10)
BlockSeries/concurrency:_4,_query_sharding_enabled:_true-8 722µs ± 1% 694µs ± 1% -3.90% (p=0.000 n=9+9)
BlockSeries/concurrency:_8,_query_sharding_enabled:_false-8 1.89ms ± 3% 1.81ms ± 9% -4.48% (p=0.009 n=10+10)
BlockSeries/concurrency:_8,_query_sharding_enabled:_true-8 745µs ±10% 709µs ± 2% -4.94% (p=0.006 n=10+8)
BlockSeries/concurrency:_16,_query_sharding_enabled:_false-8 1.85ms ± 4% 1.77ms ± 4% -4.06% (p=0.003 n=9+10)
BlockSeries/concurrency:_16,_query_sharding_enabled:_true-8 710µs ± 1% 700µs ± 2% -1.43% (p=0.029 n=10+10)
BlockSeries/concurrency:_32,_query_sharding_enabled:_false-8 2.66ms ± 8% 2.47ms ±12% -6.98% (p=0.013 n=9+10)
BlockSeries/concurrency:_32,_query_sharding_enabled:_true-8 707µs ± 1% 699µs ± 3% -1.14% (p=0.040 n=9+9)
name old alloc/op new alloc/op delta
BlockSeries/concurrency:_1,_query_sharding_enabled:_false-8 4.67MB ± 5% 4.01MB ± 8% -14.07% (p=0.000 n=9+10)
BlockSeries/concurrency:_1,_query_sharding_enabled:_true-8 1.49MB ± 0% 1.38MB ± 1% -7.66% (p=0.000 n=9+10)
BlockSeries/concurrency:_2,_query_sharding_enabled:_false-8 4.90MB ± 5% 3.89MB ± 3% -20.75% (p=0.000 n=10+10)
BlockSeries/concurrency:_2,_query_sharding_enabled:_true-8 1.51MB ± 0% 1.42MB ± 1% -5.77% (p=0.000 n=10+10)
BlockSeries/concurrency:_4,_query_sharding_enabled:_false-8 4.80MB ± 6% 3.84MB ± 1% -20.03% (p=0.000 n=9+8)
BlockSeries/concurrency:_4,_query_sharding_enabled:_true-8 1.52MB ± 1% 1.41MB ± 1% -7.04% (p=0.000 n=10+10)
BlockSeries/concurrency:_8,_query_sharding_enabled:_false-8 4.98MB ± 4% 3.96MB ± 8% -20.51% (p=0.000 n=10+10)
BlockSeries/concurrency:_8,_query_sharding_enabled:_true-8 1.52MB ± 1% 1.39MB ± 1% -8.08% (p=0.000 n=10+10)
BlockSeries/concurrency:_16,_query_sharding_enabled:_false-8 4.88MB ±15% 3.94MB ± 8% -19.43% (p=0.000 n=10+10)
BlockSeries/concurrency:_16,_query_sharding_enabled:_true-8 1.54MB ± 1% 1.41MB ± 2% -8.68% (p=0.000 n=10+10)
BlockSeries/concurrency:_32,_query_sharding_enabled:_false-8 7.05MB ±16% 5.41MB ± 5% -23.23% (p=0.000 n=10+8)
BlockSeries/concurrency:_32,_query_sharding_enabled:_true-8 1.57MB ± 2% 1.44MB ± 1% -8.13% (p=0.000 n=10+9)
name old allocs/op new allocs/op delta
BlockSeries/concurrency:_1,_query_sharding_enabled:_false-8 29.7k ± 1% 2.2k ± 1% -92.54% (p=0.000 n=10+10)
BlockSeries/concurrency:_1,_query_sharding_enabled:_true-8 5.65k ± 0% 0.85k ± 0% -84.96% (p=0.000 n=9+10)
BlockSeries/concurrency:_2,_query_sharding_enabled:_false-8 30.3k ± 0% 2.2k ± 0% -92.58% (p=0.000 n=8+9)
BlockSeries/concurrency:_2,_query_sharding_enabled:_true-8 5.64k ± 0% 0.84k ± 0% -85.04% (p=0.000 n=10+7)
BlockSeries/concurrency:_4,_query_sharding_enabled:_false-8 30.2k ± 3% 2.2k ± 1% -92.55% (p=0.000 n=10+10)
BlockSeries/concurrency:_4,_query_sharding_enabled:_true-8 5.64k ± 0% 0.85k ± 0% -85.00% (p=0.000 n=10+7)
BlockSeries/concurrency:_8,_query_sharding_enabled:_false-8 30.5k ± 1% 2.2k ± 2% -92.67% (p=0.000 n=9+10)
BlockSeries/concurrency:_8,_query_sharding_enabled:_true-8 5.65k ± 0% 0.86k ± 0% -84.84% (p=0.000 n=10+8)
BlockSeries/concurrency:_16,_query_sharding_enabled:_false-8 30.3k ± 6% 2.2k ± 3% -92.71% (p=0.000 n=10+10)
BlockSeries/concurrency:_16,_query_sharding_enabled:_true-8 5.67k ± 1% 0.87k ± 0% -84.63% (p=0.000 n=10+8)
BlockSeries/concurrency:_32,_query_sharding_enabled:_false-8 43.8k ± 5% 2.9k ± 2% -93.47% (p=0.000 n=9+8)
BlockSeries/concurrency:_32,_query_sharding_enabled:_true-8 5.72k ± 1% 0.90k ± 1% -84.19% (p=0.000 n=9+9)
According to this, with the new implementation the number of total allocations is reduced by a ~90%
.
Worth mentioning, that operation spent time seems to have worsened. However, I think it's fair to attribute this to noise, as times barely differ in ~1ms
.
Which issue(s) this PR fixes or relates to
Fixes N/A
Checklist
- [X] Tests updated
- [ ] Documentation added
- [X]
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
Worth mentioning, that operation spent time seems to have worsened. However, I think it's fair to attribute this to noise, as times barely differ in ~1ms.
I think that millisecond scale is not as neglible as it can be micro/nanosecond, can we try understand where that increase comes from? One of the things I'd blame is that async put into pool.
I think that millisecond scale is not as neglible as it can be micro/nanosecond, can we try understand where that increase comes from? One of the things I'd blame is that async put into pool.
Got some CPU profile from the benchmark, and it seems that a significant amount of time is now spent in the pool Put
method.
This makes sense, given that in some benchmark scenarios the total number of returned chunks is around 500k. 😱
Maybe establishing some max returned objects limit here is not a bad option after all. wdyt?
In the end, I decided to adapt chunks pool to contain slices with a total capacity of 10k
chunks per slice, rather than managing plain storepb.Chunk
objects.
And now benchmark results look better (see the updated comparison in the PR description).
This reduces drastically total required pool invocations, so I've also decided to remove the async method.
That's an interesting approach, where did get the 10k number? Or is it just a big but not excessively huge approach?
Or is it just a big but not excessively huge approach?
That's the idea. Though still an arbitrary number, susceptible to be changed.
I'll add some comment to clarify. 👍
UPDATE: I changed this limit to 16384, cause I ❤️ power of two numbers.
The changes LGTM, I'd still like to discuss whether it makes sense to have those pools per-block or they should be global, let's see if @pracucci & @bboreham can say something on that.