parquet-java
parquet-java copied to clipboard
ParquetOutputFormat should support custom OutputCommitter
ParquetOutputFormat should support custom OutputCommitter.
There is a need to bypass current Hadoop functionality of writing output data under _temporary folder. Especially with AWS S3, there can be huge overhead of moving the files from _temporary folder to output folder.
Reporter: Mikko Kupsu Assignee: Steve Loughran / @steveloughran
Related issues:
- Improve Parquet IO Performance within cloud datalakes (is depended upon by)
PRs and other links:
Note: This issue was originally created as PARQUET-781. Please see the migration documentation for further details.
Steve Loughran / @steveloughran: The strategy I propose for this is straightforward
change type of committer field to OutputCommitter
if (jobConf.get("option to use path output committer", false) {
outputCommitter = call super.getOutputCommitter()
if ParquetOutputFormat.getJobSummaryLevel(configuration) != None, log at warn and continue
There shouldn't be any need to do reflection games.