Unnecessary getFileStatus() calls on all part-files in ParquetInputFormat.getSplits
When testing Spark SQL Parquet support, we found that accessing large Parquet files located in S3 can be very slow. To be more specific, we have a S3 Parquet file with over 3,000 part-files, calling ParquetInputFormat.getSplits on it takes several minutes. (We were accessing this file from our office network rather than AWS.)
After some investigation, we found that ParquetInputFormat.getSplits is trying to call getFileStatus() on all part-files one by one sequentially (here). And in the case of S3, each getFileStatus() call issues an HTTP request and wait for the reply in a blocking manner, which is considerably expensive.
Actually all these FileStatus objects have already been fetched when footers are retrieved (here). Caching these FileStatus objects can greatly improve our S3 case (reduced from over 5 minutes to about 1.4 minutes).
Will submit a PR for this issue soon.
Reporter: Cheng Lian / @liancheng
Related issues:
- Use LRU caching for footers in ParquetInputFormat. (relates to)
- Cleanup FilteringParquetRowInputFormat (is related to)
- Reading Parquet InputSplits dominates query execution time when reading off S3 (is related to)
- Improve Parquet IO Performance within cloud datalakes (is depended upon by)
Note: This issue was originally created as PARQUET-16. Please see the migration documentation for further details.
Dmitriy V. Ryaboy / @dvryaboy: Could you check how this interacts with https://github.com/apache/incubator-parquet-mr/pull/2 ?
Cheng Lian / @liancheng: Hi @dvryaboy, thanks for pointing out that PR, the background information is very helpful. Actually I can make my change compatible with that PR. Essentially the problem we are facing to is much easier.
Currently, calling ParquetInputFormat.getSplits(JobContext jobContext) result a call chain like this:
List<InputSplit> getSplits(JobContext jobContext)
List<Footer> getFooters(JobContext jobContext)
List<FileStatus> listStatus(JobContext jobContext) <-- (1)
List<ParquetInputSplit> getSplits(Configuration configuration, List<Footer> footers)
... <-- (2)
Basically all the FileStatus objects are already fetched at (1), but abandoned immediately, and then fetched again at (2). The bad thing here is that (2) fetches all those objects by calling getFileStatus() on all part-files sequentially. Thus we only need to pass those fetched objects from (1) to (2), and caching is not required to solve this performance issue.