fix(cmSketch)
Fixes https://github.com/dgraph-io/ristretto/issues/108.
Actually there is one fix and one improvement.
Fix:
The fix addresses the initial concern raised in https://github.com/dgraph-io/ristretto/issues/108, namely that there
had been a prior commit that removed the influence of the row index i
on the index used to access the counter field. So while there were four
rows, the results were like there was just one row. The benefit of three
extra rows in the sketch was lost. This commit fixes that by using part
of the spread key differently for each row again. Now the minimum
that is computed by the Estimate function really has four distinct
values to work with, rather than four copies of the same counter value.
Improvement:
A few years ago I had the discussion in https://github.com/dgraph-io/ristretto/issues/108 and ended up getting xor and shift operations from the Caffeine author and the hash-prospector tool to be able to index into the cmSketch rows much more uniformly, even when the keys being given had low entropy.
In further testing, this spread the indexes being used much more
evenly, meaning there would be fewer times the counter value had to
be clipped at the max value, and this should logically make the value
returned by Estimate more accurate.
I used the hash-propector tool for 64 bit hashes to come up with the order of xor and shift operations and the constants to use but given the size of the search space, of course could not get an exhaustive answer. And in the years since, the tool has improved. I see a lot of progress was made for 32 bit hashes. Someone is welcome to look for new constants for an even better result, or even if they just wanted to document where the constants come from better than I can now.
I don't understand the ci Send Coverage failure.
I guess there is no-one there. Am closing.
But for anyone wanting to pick up this library, know there is still a bug in the cmSketch as I've tried to outline a couple times now.
@FrankReh apologies for delays from our end ... I was looking into your CI failures. Could you please rebase your branch & pull the latest changes?
There is particular fix that we did to fix the CI issues: https://github.com/dgraph-io/ristretto/pull/330 (this one), if you rebase - this should take effect.
Let me know if you have any questions.
@FrankReh thanks for reopening the PR & rebasing it, I am approving a CI run on this.
Okay - all checks are passing. We will now take a look at your PR and respond back - please be a little patient with us. I am going to add a few folks here.
Hey. No problem. Thanks for getting the CI issue fixed.
It's been a while since I first found this problem - years ago. And then wanted to raise it again last year.
If someone looking at it wants a different way to approach it, they can take their own branch and change the fours rows in the sketch to one and they'll see they get exactly the same results. The sketch has lost its function because of the xor change that went in a few years ago. If you get that far, then you may want to change it some other way. There are lots of ways of making it functional again.
Thanks.
Another optimization to sketch could be cache line fit, this may decrease hit ratio results slightly, around 1% in my Theine benchmark case. But will improve sketch performance when sketch size is very large, 3x faster in my benchmark case.
Here is a modified BenchmarkSketchEstimate case base on my Theine version, which use different size and introduce some randomness to keys:
func BenchmarkSketchEstimate(b *testing.B) {
for _, size := range []int{10, 100, 1000, 10000, 100000, 1000000, 10000000, 50000000} {
b.Run(strconv.Itoa(size), func(b *testing.B) {
s := newCmSketch(int64(size))
nums := []uint64{}
for i := 0; i < 100000; i++ {
h := xxhash.Sum64String(strconv.Itoa(rand.Intn(100000)))
nums = append(nums, h)
}
b.ResetTimer()
for n := 0; n < b.N; n++ {
s.Estimate(nums[n%100000])
}
})
}
}
And here is the results:
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkSketchEstimate/10-12 132780583 8.706 ns/op
BenchmarkSketchEstimate/100-12 129838420 9.102 ns/op
BenchmarkSketchEstimate/1000-12 139203577 8.629 ns/op
BenchmarkSketchEstimate/10000-12 139754286 8.552 ns/op
BenchmarkSketchEstimate/100000-12 100000000 10.33 ns/op
BenchmarkSketchEstimate/1000000-12 94156653 12.80 ns/op
BenchmarkSketchEstimate/10000000-12 22597843 56.92 ns/op
BenchmarkSketchEstimate/50000000-12 17342504 75.80 ns/op
Theine
That is interesting but I have to say, given how long I believe this problem has existed in the Ristretto code base now, and nobody has yet found the time to investigate the claim I made back in 2019, I would rather not muddy the waters on this one particular issue. I was guilty of trying to propose a solution that probably made the issue more complicated than it should have been.
I found the problem by just reading the open source and when I stumbled on this function, cmSketch::Estimate, I wondered how the val gotten from the first row would ever be different from the val from the 2nd, 3rd or 4th row so I added some logic to printout when they were different. They weren't. The logic was busted by the PR I've listed before: https://github.com/dgraph-io/ristretto/commit/f823dc4a5031a0404648c7e004c65d37bdf3f38d. In particular, this diff:
// Estimate returns the value of the specified key.
func (s *cmSketch) Estimate(hashed uint64) int64 {
- l, r := uint32(hashed), uint32(hashed>>32)
min := byte(255)
for i := range s.rows {
- // find the smallest counter value from all the rows
- if v := s.rows[i].get((l + uint32(i)*r) & s.mask); v < min {
- min = v
+ val := s.rows[i].get((hashed ^ s.seed[i]) & s.mask)
+ if val < min {
+ min = val
}
}
That changed code is still there, it's still wrong. https://github.com/dgraph-io/ristretto/blob/34fd2d33a86ef2eeebb75f3f5d78909b180d7410/sketch.go#L72-L82
The rows[i].get takes an index and returns a value. The broken code doesn't use all the bits for determining which index to use, it just uses the same low bits (the ones that fit into s.mask) over and over, four times. It is slightly obfuscated by the fact the low bits that are used get moved around by a different xor value for each row, but that's doesn't change the fact that most of the input hash bits are being ignored. The original code tried to use all the hash bits in determining the index. One could argue it didn't do a great job but at least it was an attempt.
If folks are happy with the sketch results for the last four years, they might as well change the number of rows from 4 to 1, saving time and space.
I think Caffeine's sketch(new cache line fit version) is very good. For people who want to fork ristretto and do some changes, maybe just copying Caffeine's sketch is enough. BTW @FrankReh did you run some hit ratio benchmarks? Because I can't reproduce ristretto's benchmark results in README, I want to know if some one also try to do that
did you run some hit ratio benchmarks?
yes, I did. But I can't speak to reproducing benchmark results so I can't help with you wanting to do an apples to apples comparison. I figure code changes faster than READMEs.
The current implementation doesn't benefit from four rows in the sketch - might as well make it one, or fix the indexing scheme. That's the only angle I can help with here.
@vmg would probably be interested in this issue since Vitess integrated a fork of this cache and it contains this deficiency. This sketch is not based on Caffeine's (either its original or cache aligned version), with the former helping @FrankReh when working on this patch.
Thanks for the link to Caffeine's original @ben-manes .
For the maintainer's, I think it shows that the original Caffeine logic for determining the sketch index to use for each row (item is the hash, i is the row):
https://github.com/ben-manes/caffeine/blob/a6be555f0f2a44d33a9d7e52ea923622e373ac7f/caffeine/src/main/java/com/github/benmanes/caffeine/cache/FrequencySketch.java#L190-L194
used the entire hash for each row (primarily through the multiplication operator), not just the same portion of the hash.
From that, it is easier to see how the mistake crept in. Someone could rationalize that XOR-ing the row seed is more efficient than adding the seed and then multiplying by the seed (and then even adding the high 32 bits to the low 32 bits), before masking for the table size. I think the multiplication step in particular is what gets all the hash bits used in determining which hash bits end up being at the low end when masked for the table size.
I'm guilty again of over complicating the explanation I think. To the maintainers, satisfy yourselves that there is a deficiency by actually checking whether the final min value is ever different from the value pulled out of the first row. You will find that they are always the same, so the three extra rows are adding nothing but taking time and space (meaning that the frequency being reported for a given item will be on the high side, unnecessarily). To get the benefit of four sketch rows, there are many calculations possible, each with speed and uniformity tradeoffs.
@ben-manes: Really appreciate the heads up. Will review this PR and possibly backport into Vitess this week. :bow:
great @vmg. I would recommend you prefer Caffeine's (either original or cache aligned), where @Yiling-J has a Go version that he uses (original and cache aligned). @FrankReh's fixes look great, just that probably fewer eyes and testing has been done on Ristretto's leading to these mistakes.
This PR has been stale for 60 days and will be closed automatically in 7 days. Comment to keep it open.