notes
notes copied to clipboard
OEIS compression challenge
The Hutter Prize awards prize money for improving the lossless data compression ratio of natural language text. Participants can submit a compressor and decompressor, or a compressed file and decompressor that manage to reproduce a certain part of the English Wikipedia[^1].
Instead of English text, it would be interesting to use the contents of the Online Encyclopedia of Integer Sequences database to see how the results and techniques differ there[^2].
For this benchmark I removed the comments (lines starting with #), as well as the A numbers (unique sequence IDs the database has assigned to each sequence) and trailing commas from the uncompressed version of the database.
The file can be found here: https://github.com/void4/compress-oeis (filtered.txt in OEIS-filtered.zip)
Each line in the file is a sequence of comma-separated integers (with an arbitrary number of digits and possibly negative), followed by a newline (\n) character
Here are the first three sequences:
0,1,1,1,2,1,2,1,5,2,2,1,5,1,2,1,14,1,5,1,5,2,2,1,15,2,2,5,4,1,4,1,51,1,2,1,14,1,2,2,14,1,6,1,4,2,2,1,52,2,5,1,5,1,15,2,13,2,2,1,13,1,2,4,267,1,4,1,5,1,4,1,50,1,2,3,4,1,6,1,52,15,2,1,15,1,2,1,12,1,10,1,4,2
1,2,2,1,1,2,1,2,2,1,2,2,1,1,2,1,1,2,2,1,2,1,1,2,1,2,2,1,1,2,1,1,2,1,2,2,1,2,2,1,1,2,1,2,2,1,2,1,1,2,1,1,2,2,1,2,2,1,1,2,1,2,2,1,2,2,1,1,2,1,1,2,1,2,2,1,2,1,1,2,2,1,2,2,1,1,2,1,2,2,1,2,2,1,1,2,1,1,2,2,1,2,1,1,2,1,2,2
1,1,1,1,2,2,1,2,2,2,3,2,2,4,2,2,4,2,3,4,4,2,3,4,2,6,3,2,6,4,3,4,4,4,6,4,2,6,4,4,8,4,3,6,4,4,5,4,4,6,6,4,6,6,4,8,4,2,9,4,6,8,4,4,8,8,3,8,8,4,7,4,4,10,6,6,8,4,5,8,6,4,9,8,4,10,6,4,12,8,6,6,4,8,8,8,4,8,6,4
Compression method | Size (MB) |
---|---|
Uncompressed | 65.6 |
kgb-9 | 18.8 |
kgb-3 | 20.4 |
lrz | 20.5 |
7z/xz (LZMA2) | 21.7 |
zip | 26.6 |
zpaq | 29.5 |
[^1]: Either the first 100MB (enwik8 challenge) or 1GB (enwik9 challenge)
[^2]: Another interesting candidate would be a list of fundamental physical constants: https://pml.nist.gov/cuu/Constants/Table/allascii.txt
Edit: Another interesting challenge would be to compress a math proof database, like the set.mm file.