elephant-bird LzoInputFormat's listStatus() can take prohibitively long on S3 because it invokes FileInputFormat's listStatus() implementation

LzoInputFormat's listStatus() can take prohibitively long on S3 because it invokes FileInputFormat's listStatus() implementation

Open nellore opened this issue 9 years ago • 11 comments

LzoInputFormat's listStatus begins with

List<FileStatus> files = super.listStatus(job);

where super refers to FileInputFormat. FileInputFormat's listStatus() calls singleThreadedListStatus() (when LIST_STATUS_NUM_THREADS == 1), which is not optimized for S3. This becomes an issue when listing a directory with many files when a job begins; I've observed that a single listStatus() call can take 17 minutes when there are 50k files in an input path.

Proposed solution: use the listStatus method of the FileSystem of the appropriate input path (obtained from getInputPaths(job)). This will call the listStatus method of whatever class is specified by fs.s3[n].impl in core-site.xml.

Dec 13 '14 19:12 nellore

@buci Does not super.listStatus(job) ultimately translate to whateverFs#listStatus ?

You can control LIST_STATUS_NUM_THREADS by specifying the conf mapreduce.input.fileinputformat.list-status.num-threads

The real issue is whether S3 returns LocatedFileStatus as HDFS does. If not FIF will have to do another round of RPC's for each File to getFileBlockLocations.

Dec 16 '14 01:12 gerashegalov

@gerashegalov True -- FileInputFormat does have to do a round of RPCs to getFileBlockLocations if file is not an instance of LocatedFileStatus. I experimented with increasing mapreduce.input.fileinputformat.list-status.num-threads to 20 to before making the pull request. It appeared to have zero effect on speed when I tried FileInputFormat's listStatus, and I think the issue is the globStatus calls. They call glob on Globber objects, which -- correct me if I'm wrong -- leads to lots of fs method invocations (like isDirectory) that each separately make S3 requests. The pull request resolves the issue on EMR by calling only methods overrided in emrfs (decompile emrfs-1.0.0.jar on one of the latest AMIs for details) -- that is, listStatus and getFileBlockLocations, which appear to be designed to minimize S3 requests. For other FileSystems, speed should not be affected.

Let me know if I'm wrong about whether globStatus is the problem. If not, there must be some other explanation why I notice a significant performance improvement with my implementation.

Dec 16 '14 14:12 nellore

Oh, I guess other FileSystems are affected if the user has set mapreduce.input.fileinputformat.list-status.num-threads....

Dec 16 '14 15:12 nellore

:+1:

Feb 28 '15 01:02 pkallos

Did you notice this issue too, pkallos?

Feb 28 '15 01:02 nellore

absolutely yes!

Mar 02 '15 18:03 pkallos

How are you solving it? The issue seems dead, but it's real, and I'm happy to code something different if someone proposes a better strategy.

Mar 03 '15 01:03 nellore

not getting around it at the moment, just absorbing the time-cost which is painful. does your solution in #428 work as advertised?

Mar 03 '15 04:03 pkallos

I've tested a few times on a few inputs, and it's listed all files quickly. Add a comment if you find an issue.

Mar 03 '15 05:03 nellore

Note that solution #428 is not backwards compatible as it does not support globs and path filters.

Mar 04 '15 03:03 gerashegalov

@buci how is your input specified? What if input is '/user/name/dir*' ? Not sure of contract for globStatus. one work around would be to specify parent directory (making sure the directory has only one file)

Mar 04 '15 08:03 rangadi

elephant-bird elephant-bird copied to clipboard

LzoInputFormat's listStatus() can take prohibitively long on S3 because it invokes FileInputFormat's listStatus() implementation

elephant-bird
elephant-bird copied to clipboard