quamina icon indicating copy to clipboard operation
quamina copied to clipboard

numbits+base128 8-byte full-precision numbers

Open timbray opened this issue 1 year ago • 8 comments
trafficstars

For background, see #336

The effect is ability to match full range of float64 and more compact representation, with significant speedup.

timbray avatar Aug 18 '24 22:08 timbray

BTW there's something wrong with the benchmarks - my local benchmarks which measure matches/second rather than ns/op have slightly improved in this merge. Need to investigate.

timbray avatar Aug 18 '24 22:08 timbray

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 97.29730% with 2 lines in your changes missing coverage. Please review.

Project coverage is 96.47%. Comparing base (db62f53) to head (b3b6934). Report is 2 commits behind head on main.

Files Patch % Lines
value_matcher.go 86.66% 2 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #349      +/-   ##
==========================================
- Coverage   96.59%   96.47%   -0.13%     
==========================================
  Files          19       20       +1     
  Lines        1940     1899      -41     
==========================================
- Hits         1874     1832      -42     
- Misses         37       39       +2     
+ Partials       29       28       -1     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Aug 19 '24 04:08 codecov-commenter

Thanks again @arnehormann, the variable-length numbits work perfectly. The numbits type is still long of course, but its toUTF8() method does the shortening. I've verified that for the “common” numbers, it produces much shorter slices.

We currently don't have a good benchmark to demonstrate a performance improvement, because all of our number benchmarks are exploring corner cases with weird huge/tiny/high-precision numbers, not just ordinary numbers like the kind that would actually be used. For the output of rand.Float64(), the average length is ~9.7 but those are definitely not everyday numbers.

But I'm confident that in practice this will give higher performance.

timbray avatar Aug 20 '24 00:08 timbray

Wow, I really need to update my local copy of the linter. Surprisingly, I think that lint failure is an actual bug; I will try to prove that with a unit test.

timbray avatar Aug 26 '24 23:08 timbray

... and just fyi (expectation management): this variable length scheme will only compress positive integers and power-of-two fractions. It won't do much for negative values because they had to be negated (xored with one bits) in the process to make them lexically sortable. Still worth it, it doesn't cost much time and no additional bytes at all.

arnehormann avatar Aug 27 '24 04:08 arnehormann

Haha, you're unduly pessimistic. numbits are integer-friendly. For the ints 0 to 100,000 inclusive, here are the counts of shortened-numbits lengths:

1: 1
2: 1
3: 115
4: 7590
5: 92294

None higher than 5. All <5 until you get over 1K. But you are correct, negative numbers all seem to be length 10

This is actually very good because in real life, many (most?) of the numbers that get matched are integers, not too large.

timbray avatar Aug 27 '24 17:08 timbray

as I said: only positive integers and power of two fractions. but positive values can be as little as 2 bytes 😁 edit... right, it can even be 1 byte 😃

arnehormann avatar Aug 27 '24 18:08 arnehormann

BTW on that last commit I neglected to add the most recent version of numbits, which has an optimized shortening that causes a tiny performance improvement. A little short of sleep these days…

timbray avatar Aug 27 '24 21:08 timbray