BlackLab
BlackLab copied to clipboard
Compress the forward index
Operations that use the forward index tend to be I/O-limited, and the forward index takes up a lot of disk space. As our larger corpora grow, and we've added more forward indexes (PoS features and such), the disk cache has become less and less effective. (the tokens file takes 4 bytes per corpus position; in our large modern Dutch corpus, we have 18 forward indexes x 8.6 GB = 155 GB, way more than memory available for disk cache)
Compressing the tokens file could help. A quick test on our large corpus shows that (g)zip-compressing the entire tokens file for the "lemma" annotation saves over 50% (from 9GB to 4.2GB). Annotations with only a few different values, such as "pos_finiteness" (values "finite" or "infinite") can achieve much higher compression ratios (up to 95%). Of course, we would have to take measure to make sure random access speeds remain reasonable (like compressing the file in blocks), which will lower compression efficiency.
This would be a balance between saving time on I/O operations and losing time on decompression, but this could be worth it, especially if multiple decompression threads can run at the same time. Potentially saving quite a bit of disk space (and speeding up copy operations) would be a nice bonus.
Compression could use variable-length quantity or VInts like Lucene already uses or the more CPU-efficient Group Varint Encoding. Because a frequent term is likely to have a low term index (because a frequent term will likely be encountered before an infrequent term during indexing), this would provide reasonably good compression while being quick to decode. Alternatively, gzip compression could be used, but this costs more CPU cycles; we could experiment to see if this is worth it. (another potential disadvantage of using gzip is having to compress in blocks, while VInt encoding
If we want to stick with fixed-length numbers, it might be worth it to use only as many bits as needed. For example, if an annotation has 18 unique values, we would use 5 bits per token to store it. This means retrieving a specific token is still very fast because we don't need to decompress an entire block, but we still save some space (an estimated 10-20%, perhaps a few 100M per billion tokens).
Another small improvement: right now we store regular two's complement signed numbers of 1, 2, 3 or 4 bytes, but the only negative number we use is -1. So we could instead store numbers in the range of [-1, 2^b-2] where b is the number of bits per token. This is essentially like storing unsigned integer, with -1 being represented by the MAX_VALUE for that unsigned integer. This saves about 1 bit per token, or 125M per billion tokens.