hadoop-lzo
hadoop-lzo copied to clipboard
Compress indexes
I added an interface to index reading/writing and provided an alternate representation of the index, which should drop the size of our index files about 4x. Haven't tested on real data, but unit tests pass. Please comment.
TODO: make the order of index serdes tried configurable via properties a-la hadoop's compression, and make the writer configurable as well (right now I just hardcode the writer implementation).
Rewrote LzoTinyOffsets to use VarInt implementation from Mahout, and got rid of numBlocks() method in the interface. Tests pass, still haven't tested on real data.
@sjlee check out this ancient pull request. The goal here is to make lzo indexes significantly smaller, making split calculation, etc, much faster. It's meant to be backwards-compatible (new hadoop-lzo can read both new and old indexes; old hadoop-lzo can't read new indexes of course). Also introduces versioning, in case we want to mess with this further.
If this is interesting, I can take a pass at making this mergeable with current master.
It does sound interesting. Could you give it a shot and let me know? Thanks.
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.