hadoop-lzo icon indicating copy to clipboard operation
hadoop-lzo copied to clipboard

Options to skip small files and not recurse on input paths

Open gsteelman opened this issue 10 years ago • 8 comments

Added support for a boolean configuration key "skip_indexing_small_files". If this is enabled, files smaller than one block in size will not be indexed. This is useful because indexing files smaller than a block is essentially wasteful. The default is false so the current behavior is preserved.

Added support for a boolean configuration key "recursive_indexing". If this is enabled, paths passed in on the command line will not be recursively searched for files to index. This allows for flexibility on specifying input paths for indexing. The default is true so the current behavior is preserved.

gsteelman avatar Aug 11 '14 21:08 gsteelman

This is odd -- I swear we did this years ago. @rangadi do you remember what the deal is? Is this something we put into EB instead of hadoop-lzo?

dvryaboy avatar Aug 12 '14 22:08 dvryaboy

It looks like the build is failing when using -P hadoop-old due to: [ERROR] Failed to execute goal on project hadoop-lzo: Could not resolve dependencies for project com.hadoop.gplcompression:hadoop-lzo:jar:0.4.20-SNAPSHOT: Could not transfer artifact org.apache.hadoop:hadoop-core:jar:1.0.4 from/to central (http://repo.maven.apache.org/maven2): GET request of: org/apache/hadoop/hadoop-core/1.0.4/hadoop-core-1.0.4.jar from central failed: Connection reset -> [Help 1]

gsteelman avatar Aug 12 '14 23:08 gsteelman

I've added a configuration option for what size of a file should be considered "small." By default it is Long.MIN_VALUE, which should preserve current behavior if it is not specified.

As it stands currently, the user configure lzo_skip_indexing_small_files = true and not configure lzo_small_file_size, which would leave the size as default Long.MIN_VALUE. In this case specifying to skip would not actually skip any files.

I see two possible remedies, any preferences on which one? I am leaning towards option 1.

  1. Ensure that skip and skip size are specified together (not just one or the other)
  2. Change default skip size to something like 1 block size.

gsteelman avatar Aug 14 '14 21:08 gsteelman

@gerashegalov @sjlee Thoughts?

gsteelman avatar Aug 29 '14 00:08 gsteelman

@dvryaboy It looks like a previous pull request https://github.com/twitter/hadoop-lzo/pull/82 did something similar, but was also never merged. It's possible the change you're talking about is in elephantbird instead of hadoop-lzo, like you said.

gsteelman avatar Aug 29 '14 01:08 gsteelman

Sorry it took me a super long time to revisit this. I went over the PR, and have some comments (some more major than others). Comments coming...

A high level comment: it would be great if you can add some unit tests that cover this.

sjlee avatar Oct 26 '15 20:10 sjlee

Could you please add unit tests around the recursive behavior? There are quite a few tests around whether the file should be indexed, but I don't see tests for the recursion.

Also, it would be great if you can test this code against real data to see if there is any surprise that isn't caught by the unit tests (and review). Thanks again!

sjlee avatar Dec 17 '15 22:12 sjlee

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Jul 18 '19 15:07 CLAassistant