tantivy posting lists seems less compact than they could be

While experimenting, I found that the posting lists (.idx files) seems to get compressed rather well by zstd (176M => 87M with zstd 1). This probably means this format could be made more compact in some way.

I tried to take a look at the file to see if I see anything obvious. Some things I saw while doing so:

there are sections in the file where every 128 bytes, there is a sequence of 16 0xff (these sequences weight for about 4% of the posting list for my particular split)
there are runs of 0x81 too, less consistent in length (sequences of more than 16 consecutive 0x81 account for ~5% of the file)
there are sections where every other byte is a 0x00
more rarely, there are runs of 0x55
the byte distribution is skewed toward some values and range of values. Beside the consequence of previous points, 0x80-0x96 seems more frequent than other bytes. This is probably not actionable without doing entropy coding, which we should probably not do (with zstd --fast=1, we go from 176M to 114M, so there are already patterns zstd manage to use without doing entropy coding)

It's hard without more knowledge of the format to know what any of that correspond to, understanding it may guide toward size optimizations.

You can find a copy of the segment I analyzed here

Jun 15 '23 08:06 trinity-1686a

Related issue https://github.com/quickwit-oss/tantivy/issues/1041

Jun 15 '23 09:06 PSeitz

(we don't know yet what the problem is) copy pasting table from @trinity-1686a

Jun 16 '23 06:06 fulmicoton

lz4 uses only duplicates for compressing data (no huffmann or ans like zstd)

➜  blub git:(main) ✗ lz4 datasets/split/346cb77c09e04022aee6c49077dbc821.idx 
Compressed filename will be: datasets/split/346cb77c09e04022aee6c49077dbc821.idx.lz4
Compressed 183824037 bytes into 147904079 ==> 80.46%
➜  blub git:(main) ✗ lz4 datasets/split/346cb77c09e04022aee6c49077dbc821.pos 
Compressed filename will be: datasets/split/346cb77c09e04022aee6c49077dbc821.pos.lz4
Compressed 101561529 bytes into 66711130 ==> 65.69%
➜  blub git:(main) ✗ lz4 datasets/split/346cb77c09e04022aee6c49077dbc821.fast 
Compressed filename will be: datasets/split/346cb77c09e04022aee6c49077dbc821.fast.lz4
Compressed 238098640 bytes into 171976107 ==> 72.23%

Jun 16 '23 06:06 PSeitz

Some more data. Percentage of 4 byte pairs, scanned in 1 byte steps. Interestingly the same pattern (more than 10%) can be observed on .idx, but not .pos between github and hdfs. (gh is one JSON field, while hdfs is body, timestamp, and severity_text)

➜  byte_distribution git:(main) ✗ cat 346cb77c09e04022aee6c49077dbc821.pos | byte_distribution
[128, 128, 128, 128]:5.01 
[133, 133, 133, 133]:3.22 
[255, 255, 255, 255]:2.68 
[134, 134, 134, 134]:1.72 
[170, 170, 170, 170]:1.15 
[135, 135, 135, 135]:0.84 
[133, 133, 128, 133]:0.82 
[133, 128, 133, 133]:0.80 
[73, 146, 36, 73]:0.72 
[146, 36, 73, 146]:0.72 
[36, 73, 146, 36]:0.72 
[128, 133, 128, 133]:0.69 
[133, 128, 133, 128]:0.67 
[128, 133, 133, 133]:0.62 
[133, 133, 133, 128]:0.62 
[128, 133, 133, 128]:0.51 
Other:78.48 

➜  byte_distribution git:(main) ✗ cat 346cb77c09e04022aee6c49077dbc821.idx | byte_distribution
[129, 129, 129, 129]:7.21 
[255, 255, 255, 255]:3.49 
[85, 85, 85, 85]:1.04 
[0, 0, 0, 0]:0.70 
[128, 128, 128, 128]:0.62 
Other:86.94

➜  hdfs git:(main) ✗ cat cd0780cd5cd24525b7eff422634a12a8.pos | byte_distribution
[204, 204, 204, 204]:1.57 
[255, 255, 255, 255]:1.36 
[187, 187, 187, 187]:1.14 
[221, 221, 221, 221]:1.11 
[238, 238, 238, 238]:0.92 
[130, 147, 137, 147]:0.77 
[137, 130, 147, 137]:0.60 
[5, 5, 5, 5]:0.50 
Other:92.03 

➜  hdfs git:(main) ✗ cat cd0780cd5cd24525b7eff422634a12a8.idx | byte_distribution
[255, 255, 255, 255]:7.31 
[129, 129, 129, 129]:3.64 
[85, 85, 85, 85]:1.35 
[17, 17, 17, 17]:1.06 
[1, 1, 1, 1]:0.56 
[129, 129, 129, 130]:0.53 
[129, 129, 130, 130]:0.50 
Other:85.05

Jun 26 '23 16:06 PSeitz

tantivy tantivy copied to clipboard

posting lists seems less compact than they could be

tantivy
tantivy copied to clipboard