pisa
pisa copied to clipboard
v1
I've been working on starting a draft of index format specification and some code examples.
Let's discuss!
- (very early stage of) draft and lib code in
include/pisa/v1 - examples in
test/test_v1.cpp
As of the latest commit, we can compress an index (compress program) and run queries both for evaluation and benchmarks with query program. It's not yet parameterized by algorithm, so only ranked OR runs but it actually performs very well, compared to the old ranked OR. Below results for blocked posting lists with SIMDBP encoding on Robust (on my laptop).
New:
[2019-11-11 21:01:47.675] [info] Mean: 1568.39
[2019-11-11 21:01:47.675] [info] 50% quantile: 1013
[2019-11-11 21:01:47.675] [info] 90% quantile: 3199
[2019-11-11 21:01:47.675] [info] 95% quantile: 6066
Old:
[2019-11-11 21:01:53.201] [stderr] [info] Mean: 1751.2
[2019-11-11 21:01:53.201] [stderr] [info] 50% quantile: 1079
[2019-11-11 21:01:53.202] [stderr] [info] 90% quantile: 3483
[2019-11-11 21:01:53.202] [stderr] [info] 95% quantile: 6693
The more interesting algorithms aren't implemented yet, but stay tuned.
Also, the compression is concurrent and quite fast, Robust compresses in seconds.
A quick update: I implemented precomputed scores (only as floats right now, without quantization) and store them as 4-bytes each. It takes a lot of space, but the idea is to maybe quantize to one byte each score, and then have a very fast score lookup. I'm expecting it to be at least as fast. I'll also experiment later with different codecs once I have integers instead of floats.
[2019-11-11 21:47:38.576] [info] Mean: 793.68
[2019-11-11 21:47:38.576] [info] 50% quantile: 418
[2019-11-11 21:47:38.576] [info] 90% quantile: 1948
[2019-11-11 21:47:38.576] [info] 95% quantile: 4284
Quick tests on Clueweb09B show essentially the same results for BM25 ranked OR as before, while the average drops from 392.932 to 267.576 with precomputed quantized scores of length 1-byte without further compression.