pisa icon indicating copy to clipboard operation
pisa copied to clipboard

v1

Open elshize opened this issue 6 years ago • 3 comments

I've been working on starting a draft of index format specification and some code examples.

Let's discuss!

elshize avatar Oct 28 '19 18:10 elshize

As of the latest commit, we can compress an index (compress program) and run queries both for evaluation and benchmarks with query program. It's not yet parameterized by algorithm, so only ranked OR runs but it actually performs very well, compared to the old ranked OR. Below results for blocked posting lists with SIMDBP encoding on Robust (on my laptop).

New:

[2019-11-11 21:01:47.675] [info] Mean: 1568.39
[2019-11-11 21:01:47.675] [info] 50% quantile: 1013
[2019-11-11 21:01:47.675] [info] 90% quantile: 3199
[2019-11-11 21:01:47.675] [info] 95% quantile: 6066

Old:

[2019-11-11 21:01:53.201] [stderr] [info] Mean: 1751.2
[2019-11-11 21:01:53.201] [stderr] [info] 50% quantile: 1079
[2019-11-11 21:01:53.202] [stderr] [info] 90% quantile: 3483
[2019-11-11 21:01:53.202] [stderr] [info] 95% quantile: 6693

The more interesting algorithms aren't implemented yet, but stay tuned.

Also, the compression is concurrent and quite fast, Robust compresses in seconds.

elshize avatar Nov 12 '19 02:11 elshize

A quick update: I implemented precomputed scores (only as floats right now, without quantization) and store them as 4-bytes each. It takes a lot of space, but the idea is to maybe quantize to one byte each score, and then have a very fast score lookup. I'm expecting it to be at least as fast. I'll also experiment later with different codecs once I have integers instead of floats.

[2019-11-11 21:47:38.576] [info] Mean: 793.68
[2019-11-11 21:47:38.576] [info] 50% quantile: 418
[2019-11-11 21:47:38.576] [info] 90% quantile: 1948
[2019-11-11 21:47:38.576] [info] 95% quantile: 4284

elshize avatar Nov 12 '19 02:11 elshize

Quick tests on Clueweb09B show essentially the same results for BM25 ranked OR as before, while the average drops from 392.932 to 267.576 with precomputed quantized scores of length 1-byte without further compression.

elshize avatar Nov 12 '19 20:11 elshize