GH-3141: Add constructor to ParquetFileReader to allow passing in parquet footer
Rationale for this change
HadoopFile is getting deprecated, it is useful to be able to pass in a parsed footer instead of reading the footer every time
What changes are included in this PR?
a new constructor
Are these changes tested?
Yes
Are there any user-facing changes?
No
Some code duplication is there, because we cant call another constructor and catch the IOException to close the stream. Let me know if you would like this handled in another way. Thanks.
Closes #$3141
BTW, could you also add a test case for it?
@yuzhu do you have time to address the above comments? I ported this change into our internal branches, with some modifications on the Spark, vectorized reading could dramatically reduce 3/4 of the namenode RPC.
@pan3793 glad it helped. I will wrap this up this week.
https://github.com/apache/spark/pull/50765 demonstrates how this PR benefits Spark.
Kindly ping @yuzhu
@yuzhu thanks for updating. I think you need a rebase to resolve conflicts
code change lgtm, better to have to test coverage as requested by @wgtmac
is there an example unit test to model after? I didnt find any unit test for this ParquetFileReader.java file
cc @yuzhu
Error: Failed to execute goal com.diffplug.spotless:spotless-maven-plugin:2.30.0:check (default) on project parquet-hadoop: The following files had format violations:
Error: src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Error: @@ -739,7 +739,8 @@
Error: ···*·@return·an·open·ParquetFileReader
Error: ···*·@throws·IOException·if·there·is·an·error·while·opening·the·file
Error: ···*/
Error: -··public·static·ParquetFileReader·open(InputFile·file,·ParquetMetadata·footer,·ParquetReadOptions·options,·SeekableInputStream·f)
Error: +··public·static·ParquetFileReader·open(
Error: +······InputFile·file,·ParquetMetadata·footer,·ParquetReadOptions·options,·SeekableInputStream·f)
Error: ······throws·IOException·{
Error: ····return·new·ParquetFileReader(file,·footer,·options,·f);
Error: ··}
Error: Run 'mvn spotless:apply' to fix these violations.
Error: -> [Help 1]
Error:
Error: To see the full stack trace of the errors, re-run Maven with the -e switch.
Error: Re-run Maven using the -X switch to enable full debug logging.
Error:
Error: For more information about the errors and possible solutions, please read the following articles:
Error: [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
Error:
Error: After correcting the problems, you can resume the build with the command
Error: mvn <args> -rf :parquet-hadoop
Error: Process completed with exit code 1.
Please fix the style issue. @yuzhu
@yuzhu @pan3793 Any update?
I'd like to pick it up if the original author does not have time to continue this PR.