hadoop-lzo icon indicating copy to clipboard operation
hadoop-lzo copied to clipboard

Compress indexes

Open dvryaboy opened this issue 13 years ago • 4 comments

I added an interface to index reading/writing and provided an alternate representation of the index, which should drop the size of our index files about 4x. Haven't tested on real data, but unit tests pass. Please comment.

TODO: make the order of index serdes tried configurable via properties a-la hadoop's compression, and make the writer configurable as well (right now I just hardcode the writer implementation).

dvryaboy avatar Feb 12 '12 00:02 dvryaboy

Rewrote LzoTinyOffsets to use VarInt implementation from Mahout, and got rid of numBlocks() method in the interface. Tests pass, still haven't tested on real data.

dvryaboy avatar Feb 15 '12 06:02 dvryaboy

@sjlee check out this ancient pull request. The goal here is to make lzo indexes significantly smaller, making split calculation, etc, much faster. It's meant to be backwards-compatible (new hadoop-lzo can read both new and old indexes; old hadoop-lzo can't read new indexes of course). Also introduces versioning, in case we want to mess with this further.

If this is interesting, I can take a pass at making this mergeable with current master.

dvryaboy avatar Aug 30 '14 18:08 dvryaboy

It does sound interesting. Could you give it a shot and let me know? Thanks.

sjlee avatar Sep 02 '14 20:09 sjlee

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Jul 18 '19 15:07 CLAassistant