Comparison with Elias-Fano
I've been playing around with various succinct algorithms recently and when I saw this project on Twitter my immediate thought was "oh, let's see how it compares with Elias-Fano". In case you (or another reader of this issue) don't know it already: Elias-Fano is an algorithm for compressing ordered integers. It's quite well-known and typically used in search engine indexes (e.g. Lucene). There's not that many resources about it, but I've written about it here: https://t.holmium.no/dia/elias-fano/.
Note that Elias-Fano has one feature which ncrlite doesn't provide: Constant access to a number at a specific index. With EF you could access the n-th number without having to decompress the previous numbers. This is actually what makes EF so useful. But I still think it's interesting to compare ncrlite against EF.
Soo I wrote a command-line tool which uses my implementation of Elias-Fano and reports the size: https://github.com/judofyr/zini/commit/0d27aae4b615c1aaeca9dde50e52ffab7d735982.
Here's the result:
Reading attic/seq/le.csv
Compressing 512652 numbers (4898608 bytes)...
The data would compress to: 806552 bytes
Reading attic/seq/primes.csv
Compressing 1000000 numbers (8245905 bytes)...
The data would compress to: 826448 bytes
Reading attic/seq/sigs.csv
Compressing 9 numbers (44 bytes)...
The data would compress to: 96 bytes
ncrlite is beating it for every file! That's very cool to see! I think I need to look a bit more into the algorithm of ncrlite and understand it further.
(Feel free to just close this issue. I guess this message is closer to a "Discussion", but this seemed to be the most suite place to write it.)
Maybe those numbers could be added in the comparison table in the README. It's a valid benchmark after all.
I'd say it's an accomplishment that would be nice to highlight in the README! These numbers also come from the baseline Elias-Fano implementation so it should be quite an "universal" comparison.
Thanks. I didn't consider EF, but I did consider Rice coding of deltas. I'm somewhat puzzled by the numbers, because I'd expect this to be similar in performance as Rice coding of deltas. Rice coding of deltas actually beats ncrlite for random subsets. ncrlite does take advantage of some structure. Eg, {990, 991, ..., 1000} is small with ncrlite, but not with Rice coding of deltas. EF has the same weakness.
Your numbers for EF are worse than those with Rice coding of deltas. Perhaps it's the choice of number of bits for the lower part? For Rice coding I used round(-log2(-log2(1 - k/N))) where N is the maximum original value and k is the number of elements in the set. What do you use?
Your numbers for EF are worse than those with Rice coding of deltas. Perhaps it's the choice of number of bits for the lower part? For Rice coding I used
round(-log2(-log2(1 - k/N)))whereNis the maximum original value andkis the number of elements in the set. What do you use?
(Using the notation I'm used to:) Number of bits for the lower parts is set to log2(u / n) + 1 where u is the universe (maximum value) and n is the number of elements in the set. I haven't derived it myself (just picked it from a paper) so I don't fully trust that it's the optimal value, but I did a quick test:
- For the
le.csvit gets set to 10. This leads to 806552 bytes. - If I override to to 9 it compresses in total to 808008 bytes.
- If I override to to 11 it compresses in total to 837872 bytes.
At the very least it's a local minima.
The overall encoding logic is quite simple: https://github.com/judofyr/zini/blob/0d27aae4b615c1aaeca9dde50e52ffab7d735982/src/EliasFano.zig#L22-L48
Ooooh, looking over the code I realize that this also computes the helper array which enables O(1) access! If I comment that out and only do EF I get the following:
Reading attic/seq/le.csv
Compressing 512652 numbers (4898608 bytes)...
The data would compress to: 770472 bytes
Reading attic/seq/primes.csv
Compressing 1000000 numbers (8245905 bytes)...
The data would compress to: 756104 bytes
Reading attic/seq/sigs.csv
Compressing 9 numbers (44 bytes)...
The data would compress to: 56 bytes
The following hacky diff:
diff --git i/src/EliasFano.zig w/src/EliasFano.zig
index cbc28b9..adafb37 100644
--- i/src/EliasFano.zig
+++ w/src/EliasFano.zig
@@ -67,7 +67,7 @@ pub fn writeTo(self: *const Self, w: anytype) !void {
const masks = self.high_bits.masks;
const len = (masks - 1)[0];
try utils.writeSlice(w, (masks - 1)[0..len]);
- try self.high_bits_select.writeTo(w);
+ // try self.high_bits_select.writeTo(w);
try self.low_bits.writeTo(w);
}
Still there is a sizeable difference. For Rice coding I get 707,236B.
from math import log
with open('le.csv') as f:
xs = list(map(int, f))
N = max(xs)+1
k = len(xs)
p = 1-k/N
b = round(-log(-log(p,2),2))
print(f'b={b} k={k} N={N}')
def rice(xs):
prev = 0
ds = []
for x in sorted(xs):
ds.append(x - prev - 1)
prev = x
total = 0
for d in ds:
total += 1 + b + (d // (2**b))
return total
print(rice(xs)/8)
In EF you allow duplicates entries, which I do not allow. Is that the difference?
I added a comparison with EF and Rice in the README.
I found one bug where I used too many bits: https://github.com/judofyr/zini/commit/b343ef6396b2ac1967ae241618809870ad631030. This puts the le.csv at ~750 kB. This does appear to be correct according to the maths:
- There's n=512652 items in the dataset.
- The largest value is u=382584264.
- We use 10 low bits.
- For the high bits the highest value is
u >> l= 373617. Each of these (minus 1) becomes a "0" in the high bit representation. - In addition we need
n1 bits.
This gives us: ((n*l) + (u>>l) - 1 + n) / 8.0 = 751598.5 bytes.
And yes, this supports encoding duplicate entries. Not quite sure what the savings will be if we don't allow that. I guess we can just subtract n from the largest value u and re-do the calculation? 🤔 This gives me 751535.875 instead which isn't really that much of an improvement…
I don't know that much about Rice coding, but I'm not too surprised that Elias-Fano is worse. Isn't the main advantage of EF that it supports constant access?
round(-lg(-lg(1-k/N))) does better than ln k/N:
fn=le.csv k=512652 N=382584265
Rice b=9 707235.5
EF l=10 751598.5
EF l=9 734219.125
fn=primes k=1000000 N=15485864
Rice b=3 668791.625
EF l=4 745983.125
EF l=3 741966.5
fn=sigs k=9 N=2055
Rice b=7 10.625
EF l=8 11.0
EF l=7 10.875
fn=squares k=1000 N=998002
Rice b=9 1432.5
EF l=10 1496.625
EF l=9 1493.5
fn=9900 k=101 N=10001
Rice b=6 107.625
EF l=7 110.625
EF l=6 107.75
Huh, indeed! Thanks for teaching me a new trick! Do you have an explanation for why this value is good?
There is a derivation on wikipedia, which ends up with the formula -log(2-p)/log(1-p). They use 1-p where I use p, so we'd have - log(1+p) / log(p) = -log_p 1+p. If p is close to 1, then this can be approximated by -log_p 2 = -1/lg p = 2^-lg(-lg(p)).
Actually, when I look at my original expression (((n*l) + (u>>l) - 1 + n)) it's very easy to actually derive the minima:
# Turn into maths:
bits(l) = ((n*l) + (u>>l) - 1 + n)
= ((n*l) + (u/2^l) - 1 + n)
# Derivative:
bits'(l) = n - log(2) (u/2^l)
# Minima should exist at:
bits'(l) = 0
n = log(2) (u/2^l)
n / (log(2) u) = 1/2^l
(log(2) u) / n = 2^l
log2((log(2) u / n) = l
log2(log(2)) + log2(u/n) = l
This does indeed give us l = log2(u/n), but there's also a constant factor of log2(log(2)) (which happens to be -0.5287).
It's not obvious to me exactly how this rounding should be. Maybe it's just easier to spend some flops on (1) calculating the actual real best value of l and then (2) trying both rounding it up and down and choose the one which gives the lowest value of bits.