dfs-datastores icon indicating copy to clipboard operation
dfs-datastores copied to clipboard

Use the CombineFileInputFormat to avoid too many mappers

Open hansmire opened this issue 10 years ago • 3 comments

Change the SequenceFilePailInputFormat to use the CombineFileInputFormat. This should reduce the number of input splits for Pail sources. In my tests, several thousand splits were reduced to one.

There is an issue with this change. It will not work with the hadoop 2.0.5-alpha, which is the version of hadoop that I have deployed. The reason is that the implementation of CombineFileInputFormat in that version does not call listStatus(JobConf conf) from the mapred package to get the list of files. Instead it calls ListStatus(JobContext conf) from the mapreduce package.

I fixed this by pulling in CombineFileInputFormat to avoid version conflicts.

hansmire avatar Apr 03 '14 06:04 hansmire

Do you take advantage of the consolidate functions on your pails ever? I personally never ran into an issue with too many small files because I always ensure that my master pails are consolidated before I run my hadoop jobs on them.

sorenmacbeth avatar Apr 03 '14 07:04 sorenmacbeth

I tried to use it, but I did not have access to the hardcoded /tmp directory. I see there is another PR to fix that problem though. Can you explain a bit more how that works?

Does the data remain partitioned as it is in the master directory? Is the master directory replaced?

hansmire avatar Apr 03 '14 20:04 hansmire

The data remains partitioned as designed. files in each sub pail with the master pail are merged in place. you can configure the size of each consolidated file as well.

Pail p = new Pail("/some/path");
p.consolidate();

On Thu, Apr 3, 2014 at 1:40 PM, Max Hansmire [email protected]:

I tried to use it, but I did not have access to the hardcoded /tmp directory. I see there is another PR to fix that problem though. Can you explain a bit more how that works?

Does the data remain partitioned as it is in the master directory? Is the master directory replaced?

Reply to this email directly or view it on GitHubhttps://github.com/nathanmarz/dfs-datastores/pull/47#issuecomment-39502028 .

http://about.me/soren

sorenmacbeth avatar Apr 03 '14 20:04 sorenmacbeth