parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

Unnecessary getFileStatus() calls on all part-files in ParquetInputFormat.getSplits

Open asfimport opened this issue 11 years ago • 2 comments

When testing Spark SQL Parquet support, we found that accessing large Parquet files located in S3 can be very slow. To be more specific, we have a S3 Parquet file with over 3,000 part-files, calling ParquetInputFormat.getSplits on it takes several minutes. (We were accessing this file from our office network rather than AWS.)

After some investigation, we found that ParquetInputFormat.getSplits is trying to call getFileStatus() on all part-files one by one sequentially (here). And in the case of S3, each getFileStatus() call issues an HTTP request and wait for the reply in a blocking manner, which is considerably expensive.

Actually all these FileStatus objects have already been fetched when footers are retrieved (here). Caching these FileStatus objects can greatly improve our S3 case (reduced from over 5 minutes to about 1.4 minutes).

Will submit a PR for this issue soon.

Reporter: Cheng Lian / @liancheng

Related issues:

Note: This issue was originally created as PARQUET-16. Please see the migration documentation for further details.

asfimport avatar Jul 10 '14 20:07 asfimport

Dmitriy V. Ryaboy / @dvryaboy: Could you check how this interacts with https://github.com/apache/incubator-parquet-mr/pull/2 ?

asfimport avatar Jul 10 '14 21:07 asfimport

Cheng Lian / @liancheng: Hi @dvryaboy, thanks for pointing out that PR, the background information is very helpful. Actually I can make my change compatible with that PR. Essentially the problem we are facing to is much easier.

Currently, calling ParquetInputFormat.getSplits(JobContext jobContext) result a call chain like this:

List<InputSplit> getSplits(JobContext jobContext)
 List<Footer> getFooters(JobContext jobContext)
  List<FileStatus> listStatus(JobContext jobContext)                                    <-- (1)
 List<ParquetInputSplit> getSplits(Configuration configuration, List<Footer> footers)
  ...                                                                                   <-- (2)

Basically all the FileStatus objects are already fetched at (1), but abandoned immediately, and then fetched again at (2). The bad thing here is that (2) fetches all those objects by calling getFileStatus() on all part-files sequentially. Thus we only need to pass those fetched objects from (1) to (2), and caching is not required to solve this performance issue.

asfimport avatar Jul 11 '14 02:07 asfimport