notes icon indicating copy to clipboard operation
notes copied to clipboard

OEIS compression challenge

Open void4 opened this issue 3 years ago • 0 comments

The Hutter Prize awards prize money for improving the lossless data compression ratio of natural language text. Participants can submit a compressor and decompressor, or a compressed file and decompressor that manage to reproduce a certain part of the English Wikipedia[^1].

Instead of English text, it would be interesting to use the contents of the Online Encyclopedia of Integer Sequences database to see how the results and techniques differ there[^2].

For this benchmark I removed the comments (lines starting with #), as well as the A numbers (unique sequence IDs the database has assigned to each sequence) and trailing commas from the uncompressed version of the database.

The file can be found here: https://github.com/void4/compress-oeis (filtered.txt in OEIS-filtered.zip)

Each line in the file is a sequence of comma-separated integers (with an arbitrary number of digits and possibly negative), followed by a newline (\n) character

Here are the first three sequences:

0,1,1,1,2,1,2,1,5,2,2,1,5,1,2,1,14,1,5,1,5,2,2,1,15,2,2,5,4,1,4,1,51,1,2,1,14,1,2,2,14,1,6,1,4,2,2,1,52,2,5,1,5,1,15,2,13,2,2,1,13,1,2,4,267,1,4,1,5,1,4,1,50,1,2,3,4,1,6,1,52,15,2,1,15,1,2,1,12,1,10,1,4,2
1,2,2,1,1,2,1,2,2,1,2,2,1,1,2,1,1,2,2,1,2,1,1,2,1,2,2,1,1,2,1,1,2,1,2,2,1,2,2,1,1,2,1,2,2,1,2,1,1,2,1,1,2,2,1,2,2,1,1,2,1,2,2,1,2,2,1,1,2,1,1,2,1,2,2,1,2,1,1,2,2,1,2,2,1,1,2,1,2,2,1,2,2,1,1,2,1,1,2,2,1,2,1,1,2,1,2,2
1,1,1,1,2,2,1,2,2,2,3,2,2,4,2,2,4,2,3,4,4,2,3,4,2,6,3,2,6,4,3,4,4,4,6,4,2,6,4,4,8,4,3,6,4,4,5,4,4,6,6,4,6,6,4,8,4,2,9,4,6,8,4,4,8,8,3,8,8,4,7,4,4,10,6,6,8,4,5,8,6,4,9,8,4,10,6,4,12,8,6,6,4,8,8,8,4,8,6,4
Compression method Size (MB)
Uncompressed 65.6
kgb-9 18.8
kgb-3 20.4
lrz 20.5
7z/xz (LZMA2) 21.7
zip 26.6
zpaq 29.5

[^1]: Either the first 100MB (enwik8 challenge) or 1GB (enwik9 challenge)

[^2]: Another interesting candidate would be a list of fundamental physical constants: https://pml.nist.gov/cuu/Constants/Table/allascii.txt

Edit: Another interesting challenge would be to compress a math proof database, like the set.mm file.

void4 avatar Jun 17 '21 17:06 void4