elephant-bird
elephant-bird copied to clipboard
LzoInputFormat's listStatus() can take prohibitively long on S3 because it invokes FileInputFormat's listStatus() implementation
LzoInputFormat's listStatus begins with
List<FileStatus> files = super.listStatus(job);
where super refers to FileInputFormat. FileInputFormat's listStatus() calls singleThreadedListStatus() (when LIST_STATUS_NUM_THREADS == 1), which is not optimized for S3. This becomes an issue when listing a directory with many files when a job begins; I've observed that a single listStatus() call can take 17 minutes when there are 50k files in an input path.
Proposed solution: use the listStatus method of the FileSystem of the appropriate input path (obtained from getInputPaths(job)). This will call the listStatus method of whatever class is specified by fs.s3[n].impl in core-site.xml.
@buci Does not super.listStatus(job) ultimately translate to whateverFs#listStatus ?
You can control LIST_STATUS_NUM_THREADS by specifying the conf mapreduce.input.fileinputformat.list-status.num-threads
The real issue is whether S3 returns LocatedFileStatus
as HDFS does. If not FIF will have to do another round of RPC's for each File to getFileBlockLocations
.
@gerashegalov True -- FileInputFormat
does have to do a round of RPCs to getFileBlockLocations
if file
is not an instance of LocatedFileStatus
. I experimented with increasing mapreduce.input.fileinputformat.list-status.num-threads
to 20 to before making the pull request. It appeared to have zero effect on speed when I tried FileInputFormat's listStatus
, and I think the issue is the globStatus
calls. They call glob
on Globber
objects, which -- correct me if I'm wrong -- leads to lots of fs method invocations (like isDirectory) that each separately make S3 requests. The pull request resolves the issue on EMR by calling only methods overrided in emrfs (decompile emrfs-1.0.0.jar on one of the latest AMIs for details) -- that is, listStatus
and getFileBlockLocations
, which appear to be designed to minimize S3 requests. For other FileSystems, speed should not be affected.
Let me know if I'm wrong about whether globStatus
is the problem. If not, there must be some other explanation why I notice a significant performance improvement with my implementation.
Oh, I guess other FileSystems are affected if the user has set mapreduce.input.fileinputformat.list-status.num-threads
....
:+1:
Did you notice this issue too, pkallos?
absolutely yes!
How are you solving it? The issue seems dead, but it's real, and I'm happy to code something different if someone proposes a better strategy.
not getting around it at the moment, just absorbing the time-cost which is painful. does your solution in #428 work as advertised?
I've tested a few times on a few inputs, and it's listed all files quickly. Add a comment if you find an issue.
Note that solution #428 is not backwards compatible as it does not support globs and path filters.
@buci how is your input specified? What if input is '/user/name/dir*' ? Not sure of contract for globStatus. one work around would be to specify parent directory (making sure the directory has only one file)