mimir Creates a custom memcached client to reuse memory

Creates a custom memcached client to reuse memory

Open cyriltovena opened this issue 2 years ago • 23 comments

The current memcached client always throw away the buffer which creates a lot of allocations.

It would be great to have a reusable buffer, although this might be difficult since there's a lot of cache interface used along the way.

Sep 29 '21 13:09 cyriltovena

Just to clarify, by current memcached client you refer to github.com/bradfitz/gomemcache module or some internal interface?

Oct 06 '21 07:10 ortuman

I refer to this one yes. but it's hidden down below into mimir.

Oct 06 '21 09:10 cyriltovena

Just took a look and seems like the memcached client connection already has a RW buffer which is being reused on every operation. Maybe be I'm overseeing something or the buffer drop happens elsewhere?

Oct 06 '21 10:10 ortuman

It's more items data sent back to users of the library take a profile of the store gateway, it should be one of the most alloc. The author left a todo if I remember correctly too.

Oct 06 '21 11:10 cyriltovena

gotcha! 👍

Oct 06 '21 11:10 ortuman

This one https://github.com/bradfitz/gomemcache/blob/a41fca850d0b6f392931a78cbae438803ea0b886/memcache/memcache.go#L509

Would be awesome to have a way to reuse them when the item is not needed anymore.

Oct 06 '21 11:10 cyriltovena

Would be awesome to have a way to reuse them when the item is not needed anymore.

Definitely agree. I think this is a point of improvement that could have a quite positive impact.

Maybe we can simply add a Close() error method to Item type to be explicitly called by the invoker. When called we put the buffer back into a shared pool. This has the downside of having to manually invoke this method after every Get or GetMulti operation, but in the worst case the GC will eventually reclaim the buffer instead of putting it back into the pool. Not too bad.

WDYT?

Oct 06 '21 13:10 ortuman

That's the spirit I think the hardest part is the usage.

If you go for it Prometheus code base as a pool that is bucketed and should work nicely. May be buckets needs to be configurable

Oct 06 '21 16:10 cyriltovena

When you get a chance, take a look at these changes:

https://github.com/ortuman/gomemcache/commit/8bb94a59999fc52836308966af8ec058d3d706f0

Also I've done some benchmarking and seen a slightly better result when reusing buffers. I guess the true value of this solution will be more noticeable in the long term, by minimizing the number of GC pauses..

➜  memcache git:(master) ✗ go test -bench=BenchmarkItemValuePool -benchtime=20s

goos: darwin
goarch: arm64
pkg: github.com/bradfitz/gomemcache/memcache
BenchmarkItemValuePool/benchmark-no-buffer-reuse-8         	  492558	     45478 ns/op
BenchmarkItemValuePool/benchmark-buffer-reuse-8            	  549673	     44162 ns/op
PASS
ok  	github.com/bradfitz/gomemcache/memcache	54.019s

Oct 07 '21 13:10 ortuman

The benchmark is wrong they both make the slice.

I think you should:

use a https://github.com/prometheus/prometheus/blob/main/pkg/pool/pool.go
allocate inside the Get and not outside. ( The one allocating should be the one who knows the size)
The pool can be global to all client IMO.

Oct 11 '21 08:10 cyriltovena

The benchmark is wrong they both make the slice.

The first approach does not implicitly invoke Close method, whereas the second one does. This causes no buffer object to be reused in the first benchmark case.

Anyway, I like the idea of using a bucket bufffer pool instead. This would help to reduce fragmentation, and thus reducing GC interventions.

Take a look at ortuman/gomemcache@71beaed139c365c02400465e32bf80614da6247a and tell me how you feel about it.

Also, here's a table of preliminary benchmark results:

goos: darwin
goarch: arm64
pkg: github.com/bradfitz/gomemcache/memcache
BenchmarkItemValuePool
BenchmarkItemValuePool/benchmark-no-buffer-reuse
BenchmarkItemValuePool/benchmark-no-buffer-reuse-8         	   22761	     45378 ns/op
BenchmarkItemValuePool/benchmark-buffer-reuse
BenchmarkItemValuePool/benchmark-buffer-reuse-8            	   27661	     39797 ns/op
PASS

Oct 11 '21 10:10 ortuman

It's better but I don't think you need a bytes.Buffer you can use the slice and re-sliced it to length required.

Oct 11 '21 12:10 cyriltovena

Oh! You're 100% right. Just changed it @ https://github.com/ortuman/gomemcache/commit/ab4be5bd29b01f1f7687db3b4333890a514ac6a2

Oct 11 '21 14:10 ortuman

Thanks @ortuman for working on this! Working on your personal repo is fine for experimenting, but if we'll end up vendoring this in Mimir then please fork gomemcached under the grafana organization and open a PR with your changes so we can do a proper review. Thanks!

Oct 11 '21 15:10 pracucci

Sure thing! As you mentioned I just wanted to experiment on my own repo before doing a formal fork. If this goes on I'll follow your suggestion. Thanks!

Oct 13 '21 07:10 ortuman

Note Mimir currently uses https://github.com/themihai/gomemcache fork, in order to add a circuit-breaker (only in query-frontend right now).

Oct 27 '21 09:10 bboreham

👋 Hey was passing by, while investigating a different issue on dev, I noticed on a profile that a lot of time is spent in the scanGetResponseLine function, and it looks unoptimised in several different ways.

The most obvious one is that we could use

n, err := fmt.Fscanf(bytes.NewReader(line), pattern, dest...)

Instead of

n, err := fmt.Sscanf(string(line), pattern, dest...)

To avoid copying every single line just to scan it.

Apart from that, I'd consider implementing a custom parser for that line format instead of relying on fmt.Scanf which seems to underperform here (I'm not familiar with the memcached response format, but do we really need to decode unicode here? It seems that a lot of time is spent just checking whether the next rune is a space)

Oct 28 '21 09:10 colega

Just for reference, it would be great to have the GATS command implemented too: https://github.com/grafana/mimir/pull/718

Jan 11 '22 08:01 colega

Some more data, was checking continous profiling on dev and realised that store-gateway spends 12% of its time just parsing the memcached response strings:

flamegraph_visual_03_01_2022

Feb 03 '22 16:02 colega

To avoid copying every single line just to scan it.

I wanted to check the size of this problem, so I grabbed another profile.

Total: 6.89TB
ROUTINE ======================== github.com/bradfitz/gomemcache/memcache.scanGetResponseLine in bradfitz/gomemcache/memcache/memcache.go
   52.24GB   104.95GB (flat, cum)  1.49% of Total
         .          .    500:   }
         .          .    501:}
         .          .    502:
         .          .    503:// scanGetResponseLine populates it and returns the declared size of the item.
         .          .    504:// It does not read the bytes of the item.
    6.08GB     6.08GB    505:func scanGetResponseLine(line []byte, it *Item) (size int, err error) {
         .          .    506:   pattern := "VALUE %s %d %d %d\r\n"
         .          .    507:   dest := []interface{}{&it.Key, &it.Flags, &size, &it.casid}
         .          .    508:   if bytes.Count(line, space) == 3 {
         .          .    509:           pattern = "VALUE %s %d %d\r\n"
         .          .    510:           dest = dest[:3]
         .          .    511:   }
   46.16GB    98.87GB    512:   n, err := fmt.Sscanf(string(line), pattern, dest...)
         .          .    513:   if err != nil || n != len(dest) {
  512.01kB   512.01kB    514:           return -1, fmt.Errorf("memcache: unexpected line in get response: %q", line)
         .          .    515:   }
         .          .    516:   return size, nil
         .          .    517:}

So copying the line is ~0.7% of all allocations; probably not the big win.

Agreed that a custom parser can perform better.

Feb 15 '22 17:02 bboreham

I coded scanGetResponseLine() using IndexByte() instead of SScanf().

https://github.com/bradfitz/gomemcache/compare/master...bboreham:faster-scanline

name                   old time/op    new time/op    delta
ScanGetResponseLine-4    4.06µs ± 1%    0.13µs ± 4%  -96.89%  (p=0.008 n=5+5)

name                   old alloc/op   new alloc/op   delta
ScanGetResponseLine-4      128B ± 0%       24B ± 0%  -81.25%  (p=0.008 n=5+5)

name                   old allocs/op  new allocs/op  delta
ScanGetResponseLine-4      7.00 ± 0%      1.00 ± 0%  -85.71%  (p=0.008 n=5+5)

(Note I still copy the line to a string, because we need a string to put in Item.Key. I'm copying more than is strictly needed but it simplifies the code)

Feb 16 '22 11:02 bboreham

I coded scanGetResponseLine() using IndexByte() instead of SScanf().

Awesome!

Since we're using a fork, can you send a PR to the fork? However... The fork hasn't been updated in 4 years, so maybe it's just easier for you to bring in the changes required.

Feb 21 '22 14:02 colega

<fx: time passes...>

I have created https://github.com/grafana/gomemcache/ with the above change.

Aug 12 '22 14:08 bboreham

mimir mimir copied to clipboard

Creates a custom memcached client to reuse memory

mimir
mimir copied to clipboard