hadoop-lzo
hadoop-lzo copied to clipboard
Options to skip small files and not recurse on input paths
Added support for a boolean configuration key "skip_indexing_small_files". If this is enabled, files smaller than one block in size will not be indexed. This is useful because indexing files smaller than a block is essentially wasteful. The default is false so the current behavior is preserved.
Added support for a boolean configuration key "recursive_indexing". If this is enabled, paths passed in on the command line will not be recursively searched for files to index. This allows for flexibility on specifying input paths for indexing. The default is true so the current behavior is preserved.
This is odd -- I swear we did this years ago. @rangadi do you remember what the deal is? Is this something we put into EB instead of hadoop-lzo?
It looks like the build is failing when using -P hadoop-old due to: [ERROR] Failed to execute goal on project hadoop-lzo: Could not resolve dependencies for project com.hadoop.gplcompression:hadoop-lzo:jar:0.4.20-SNAPSHOT: Could not transfer artifact org.apache.hadoop:hadoop-core:jar:1.0.4 from/to central (http://repo.maven.apache.org/maven2): GET request of: org/apache/hadoop/hadoop-core/1.0.4/hadoop-core-1.0.4.jar from central failed: Connection reset -> [Help 1]
I've added a configuration option for what size of a file should be considered "small." By default it is Long.MIN_VALUE, which should preserve current behavior if it is not specified.
As it stands currently, the user configure lzo_skip_indexing_small_files = true and not configure lzo_small_file_size, which would leave the size as default Long.MIN_VALUE. In this case specifying to skip would not actually skip any files.
I see two possible remedies, any preferences on which one? I am leaning towards option 1.
- Ensure that skip and skip size are specified together (not just one or the other)
- Change default skip size to something like 1 block size.
@gerashegalov @sjlee Thoughts?
@dvryaboy It looks like a previous pull request https://github.com/twitter/hadoop-lzo/pull/82 did something similar, but was also never merged. It's possible the change you're talking about is in elephantbird instead of hadoop-lzo, like you said.
Sorry it took me a super long time to revisit this. I went over the PR, and have some comments (some more major than others). Comments coming...
A high level comment: it would be great if you can add some unit tests that cover this.
Could you please add unit tests around the recursive behavior? There are quite a few tests around whether the file should be indexed, but I don't see tests for the recursion.
Also, it would be great if you can test this code against real data to see if there is any surprise that isn't caught by the unit tests (and review). Thanks again!
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.