Improve Cache Density and Cost Estimate

Hi Authzed folks - apologies in advance for this wall of text. 🙂

I noticed a few weeks ago that the cache cost functions are not accurate if the cost represents bytes (which I believe it does). For example, the cost of a checkResultEntry is set to just 8 bytes, the cost of that struct when empty. But that cost doesn't include the memory pointed to by checkResultEntry.response, which could be much more.

As I worked to improve the cache cost functions, I found a way to fit 2x more cache items into the same amount of memory: instead of caching the Go structs, cache the protobuf-marshaled bytes.

The improved cache cost functions help keep the physical memory used by the cache much closer to the configured max cost.

I'd be happy to open some PRs for these changes, but wanted to post my findings here and see which of the changes you'd like (if any).

Cache Density

I experimented with storing the marshaled bytes of protobuf messages rather than the Go objects directly.

There are two main advantages to this:

Calculating the cost of a []byte is quite simple. Most importantly, the cost function does not need to change as the protobuf message changes: protobuf takes care of those details.
Second, the cache can store more items per MB of space used. In one test (below), the cache fit 212% more items per MB! However, later tests with more accurate cost functions improved cache density by a more modest 50-70%. All tests were on a single local instance of spicedb, so a load test at scale is warranted.

Below are the results for two tests run on a single spicedb instance serving check requests. Total profiled space is for the whole application, while cache profiled space includes just the stacks related to caching. In this test, the cost function was still poor, but it does show that using marshaled bytes significantly improves cache density.

test	total profiled space	cache profiled space	cache calculated cost	key count	keys/ cache profiled MB
protobuf structs	69.16 MB	54.85 MB	32 MB	142,857	2,605
marshaled []byte	77.02 MB	61.0 MB	30.1 MB	337,311	5,529

Of course, marshaling isn't free. However, existing code already calls proto.Clone() on every cache write, and as that is replaced with the call to proto.Marshal(), the relative cost may not be significant. Still, a test to check impact on CPU during a load test is warranted.

Cache Cost Function

Now, the long story.

Background

As stated above, the cache was using more memory than the 'max cost' setting because the cost of each cached item was being set to the size of a pointer (8 bytes) rather than the size of the memory referenced by a pointer.

The first attempt at improving the cost function made the situation better, but there was still a substantial difference between the configured cache size and the total memory used. Below are flamegraphs for in-use space for a local spicedb instance, taken after running a 15 minute load test of check requests. Between 0 and 32 MB cache, the memory increased 59MB, 184% the increase in cache size. Between 32 and 64 MB cache, the memory increased 70MB, 219% the increase in cache size.

1 byte Cache (single instance, local)

32 MB Cache (single instance, local)

64 MB Cache (single instance, local)

Aside on Profiling

In the flamegraphs above, the in-use bytes within ristretto.(*Cache).processItems are very close to the allocated cache size. Also, the bytes allocated within caching.(#Dispatcher).DispatchCheck grow proportionally with the cache size.

Initially I thought this meant the DispatchCheck() function was responsible for leaking memory. However, I no longer think that is the case.

Heap profiles work by sampling allocations. When a sample is taken, the stack responsible for the allocation is added to the profile. So, seeing DispatchCheck() in the flamegraph doesn't mean that DispatchCheck() is responsible for keeping bytes from GC, only that it was responsible for originally allocating those bytes.

Reviewing the spiceDB code, this makes sense - DispatchCheck() creates the object that is stored in the cache (via proto.Clone()), but then it is the cache that keeps that object from GC. When ristretto stores an item, it allocates a wrapper struct, which explains why it is also in the profile.

Given this, the best way to measure memory used by the cache is to sum ristretto.(*Cache).processItems and proto.Clone. Doing so for the examples above gives 113MB for the 64MB cache (176% larger) and 59MB for the 32MB cache (184% larger).

Size Classes

One of the main breakthroughs I had was learning about class sizes in Go. Class sizes are predefined object sizes (8, 16, 24, 32, 48, etc). When allocating a 'small' object, Go takes the number of required bytes and then allocates the next size class larger than what is required. This is done to make GC tracking more efficient for small objects. See 'One more thing' section.

So, a cost function that returns only the bytes required for an object will systematically under-report the actual cost in memory!

This article indicates that append() is aware of class sizes and can be used to find them at run time. This code demonstrates: https://go.dev/play/p/lRaSqzunZ73

After accounting for class sizes, I was able to write a cost function that exactly matched the allocated bytes, as reported by memstats.TotalAlloc.

Keys Count Too

Still, even accounting for size classes, the cost function was not controlling memory like I wanted. How could my tests show a perfect match to the reported allocated memory, but still allow the cache to grow beyond max cost? The answer is fairly simple: cache keys are stored too, and take up memory. After including keys in the cost function, I got the following results (caching []byte):

test	total profiled space	cache profiled space	cache computed space	key count	keys/cache profiled MB
8MB cache	33.1 MB	16.2 MB	8 MB	42,094	2,598
16MB cache	40.4 MB	24.3 MB	16 MB	84,097	3,460
32MB cache	63.8 MB	44.4 MB	32 MB	168,152	3,787

The difference in cache size between 8MB and 16MB max cost was 8.1MB! Between 16MB and 32MB, 20.1 MB, which is off by about 26%.

Final Cost Function (protobuf structs, not bytes)

This test was run with a cost function that accounted for keys and size classes. No changes were made to the objects stored in the cache for this test.

test	total profiled space	cache profiled space	cache computed space	key count	keys/cache profiled MB
no cache (1 byte)	15.6 MB	0 MB	0 MB	0	0
16MB cache	34.8 MB	21.5 MB	16 MB	46,916	2,182
32MB cache	55.2 MB	37.8 MB	32 MB	93,825	2,482

This shows there is still some overhead for the cache, since going from a cache with only 1 byte max cost (effectively, no cache) to 16 MB cost added 21.5 MB to memory used by the cache. But, going from 16MB to 32MB added 16.3MB, off by ~2%.

Compared to the test which used a similar cost function, but stored bytes instead, this also shows that storing bytes is still more efficient, although less so than in the original test. This makes sense, because now that they key is included in the cost function, the space saved on the items themselves is a smaller proportion of the total cost per entry.

Misc Learnings

Are there memory leaks?
- I don't think so. Once the cache reaches capacity and begins to evict items, memory use is stable.
Is protocol buffers increasing memory footprint?
- The items stored in the cache are protobuf generated types and have some fields specific to protobuf (protoimpl.MessageState, protoimpl.SizeCache, protoimpl.UnknownFields). It is possible these fields are getting populated after the cost function runs and increasing memory footprint beyond what the cost function calculates. Running spicedb locally, I did see that this was the case - sending a message from the cache caused its size to increase significantly. However, subsequent sends shared the memory added by the first send. To further test if protobuf fields were increasing cost, I ran tests where a the cached object was never returned to callers, only deep copies. Memory use was similar enough that I don't think the protobuf fields have a significant impact.
- 32 MB Cache (main)
- 32 MB Cache (clone on return)

Jun 02 '22 18:06 benCoomes

Next Steps

As for what to do next -

Would you welcome a PR for caching marshaled bytes? I can open one for that.
Would you welcome a PR for an improved cache cost func? I can open for this too, but it would be different depending on if storing marshaled bytes or not.

Jun 02 '22 18:06 benCoomes

Would you welcome a PR for caching marshaled bytes? I can open one for that.

Load testing with a "normal" machine, e.g. one with a common proportion between CPU and Memory, even with the estimator off by a factor of 2x shows that SpiceDB is CPU bound. I would be loathe to add any more work to the CPU than necessary. Our cache items aren't particularly long lived either due to the nature of the evaluation revision picking, so it's hard to make the argument that by having a significantly larger or more dense cache we will save an equivalent amount of CPU.

I would love to be proved wrong though.

Would you welcome a PR for an improved cache cost func? I can open for this too, but it would be different depending on if storing marshaled bytes or not.

Almost certainly! We would again need to evaluate the CPU impact of counting the bytes more accurately, but this would go a long way in order to prove that SpiceDB does what it says it does regarding cache memory usage.

Jun 02 '22 19:06 jakedt

Load testing with a "normal" machine, e.g. one with a common proportion between CPU and Memory, even with the estimator off by a factor of 2x shows that SpiceDB is CPU bound. I would be loathe to add any more work to the CPU than necessary.

I am curious how much the CPU cost increases, so I'll run a load test and share the results.

Almost certainly! We would again need to evaluate the CPU impact of counting the bytes more accurately, but this would go a long way in order to prove that SpiceDB does what it says it does regarding cache memory usage.

Sounds good! I'll run a load test for the cache cost function change only (not storing marshaled bytes) and share the results for that too.

Jun 02 '22 20:06 benCoomes

How did the load tests go? Anything interesting to report?

Jun 06 '22 14:06 jakedt

Hey @jakedt! I ran them Friday but haven't analyzed the results yet. I'll post a summary here later today, and I think I'll be able to give more details in Discord.

Jun 06 '22 15:06 benCoomes

Summary

Looking over the load tests, I don’t think that either the ‘accurate cost function’ or ‘marshaling cache’ changes make sense to implement. However, it still may be good to change the cost function so that it scales with with item size, even if it doesn’t report memory accurately. Or, the ‘max cost’ settings could be set in terms of items, not memory use.

Note that the baseline for these tests is not the cost function currently in authzed/spicedb, which gives a cost of ‘8’ to every entry. Instead, it is one that scales with the memory used by each cache item, but doesn’t try to accurately report the physical memory used per item. Using the existing cost function would have caused our pods to hit their memory limits, even with 192MB caches.

The ‘new cost function’ is one that does try to report physical memory accurately, taking into account size classes and key costs.

The ‘marshaling’ tests also use an accurate cache function. In addition, the cached items are the marshaled protobuf messages and not the Go structs.

Results

Against the baseline, the accurate coster used 4.7% more CPU, and the marshaling cache used 5.7% more CPU.

No cost function changed memory use as would be expected for different cache sizes, which was the main goal.

The marshaling cache did improve density: 1.26 million keys/GB vs 0.86 million keys/GB (memory for whole app, not just cache).

However, and perhaps most importantly, p99 request latency was 50-60% slower for the accurate cost and marshaling caches.

Like I said before, I should be able to share more details on the tests privately. But for now, I think the only change which might make sense is to either have a cost function that scales with item size (and a disclaimer that the max cost settings are not accurate and more/less memory may be used), or to set max cost in terms of key count rather than memory.

Jun 06 '22 23:06 benCoomes

Maybe we could move the accurate coster into a goroutine decoupled from the serving process. It shouldn't affect latency on the read path.

Jun 06 '22 23:06 jakedt

That's a good idea! I'll give that a try and see how it goes. But still, while the accurate costers controlled memory use correctly for a single local instance, they didn't control memory use like I would have expected in the load tests. That could be a symptom of uneven request distribution, or maybe something else I am missing though.

Jun 07 '22 12:06 benCoomes

Calculating the cost and inserting on a goroutine helped a bit, but read latency was still 47% slower compared to baseline. I did confirm that the buffered channel was big enough to not block senders during the test.

It is interesting to me that the cache cost seems to have such a large impact on latency, even though benchmarking the cost function indicates that it is fairly cheap. The number in each benchmark is the number of relation references for each slice in the metadata struct.

I don't know how many relation references are present when the load test runs, but I imagine it'd be on the lower end. And even if it were 10k, 0.093 ms is nowhere close to the added latency.

Benchmark_CheckResultCost/0-10          225625952                5.291 ns/op           0 B/op          0 allocs/op
Benchmark_CheckResultCost/10-10         10572729               112.6 ns/op             0 B/op          0 allocs/op
Benchmark_CheckResultCost/100-10        1237676               970.9 ns/op             0 B/op          0 allocs/op
Benchmark_CheckResultCost/1000-10       127348              9401 ns/op               0 B/op          0 allocs/op
Benchmark_CheckResultCost/10000-10      12862             93854 ns/op               0 B/op          0 allocs/op

Jun 07 '22 16:06 benCoomes

some of the work we did recently to revamp how we do caching cost should have addressed this. Feel free to reopen it this is still an issue!

Nov 23 '22 18:11 vroldanbet

spicedb
spicedb copied to clipboard

Better Caching Cost & Density

Improve Cache Density and Cost Estimate

Cache Density