parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

GH-3141: Add constructor to ParquetFileReader to allow passing in parquet footer

Open yuzhu opened this issue 9 months ago • 8 comments

Rationale for this change

HadoopFile is getting deprecated, it is useful to be able to pass in a parsed footer instead of reading the footer every time

What changes are included in this PR?

a new constructor

Are these changes tested?

Yes

Are there any user-facing changes?

No

Some code duplication is there, because we cant call another constructor and catch the IOException to close the stream. Let me know if you would like this handled in another way. Thanks.

Closes #$3141

yuzhu avatar Feb 28 '25 22:02 yuzhu

BTW, could you also add a test case for it?

wgtmac avatar Mar 01 '25 15:03 wgtmac

@yuzhu do you have time to address the above comments? I ported this change into our internal branches, with some modifications on the Spark, vectorized reading could dramatically reduce 3/4 of the namenode RPC.

pan3793 avatar Apr 30 '25 03:04 pan3793

@pan3793 glad it helped. I will wrap this up this week.

yuzhu avatar Apr 30 '25 03:04 yuzhu

https://github.com/apache/spark/pull/50765 demonstrates how this PR benefits Spark.

pan3793 avatar Apr 30 '25 09:04 pan3793

Kindly ping @yuzhu

pan3793 avatar May 12 '25 10:05 pan3793

@yuzhu thanks for updating. I think you need a rebase to resolve conflicts

pan3793 avatar Jun 23 '25 02:06 pan3793

code change lgtm, better to have to test coverage as requested by @wgtmac

pan3793 avatar Jun 23 '25 03:06 pan3793

is there an example unit test to model after? I didnt find any unit test for this ParquetFileReader.java file

yuzhu avatar Jun 23 '25 04:06 yuzhu

cc @yuzhu image

wangyum avatar Jul 11 '25 06:07 wangyum

Error:  Failed to execute goal com.diffplug.spotless:spotless-maven-plugin:2.30.0:check (default) on project parquet-hadoop: The following files had format violations:
Error:      src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Error:          @@ -739,7 +739,8 @@
Error:           ···*·@return·an·open·ParquetFileReader
Error:           ···*·@throws·IOException·if·there·is·an·error·while·opening·the·file
Error:           ···*/
Error:          -··public·static·ParquetFileReader·open(InputFile·file,·ParquetMetadata·footer,·ParquetReadOptions·options,·SeekableInputStream·f)
Error:          +··public·static·ParquetFileReader·open(
Error:          +······InputFile·file,·ParquetMetadata·footer,·ParquetReadOptions·options,·SeekableInputStream·f)
Error:           ······throws·IOException·{
Error:           ····return·new·ParquetFileReader(file,·footer,·options,·f);
Error:           ··}
Error:  Run 'mvn spotless:apply' to fix these violations.
Error:  -> [Help 1]
Error:  
Error:  To see the full stack trace of the errors, re-run Maven with the -e switch.
Error:  Re-run Maven using the -X switch to enable full debug logging.
Error:  
Error:  For more information about the errors and possible solutions, please read the following articles:
Error:  [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
Error:  
Error:  After correcting the problems, you can resume the build with the command
Error:    mvn <args> -rf :parquet-hadoop
Error: Process completed with exit code 1.

Please fix the style issue. @yuzhu

wgtmac avatar Jul 14 '25 03:07 wgtmac

@yuzhu @pan3793 Any update?

wangyum avatar Jul 31 '25 03:07 wangyum

I'd like to pick it up if the original author does not have time to continue this PR.

pan3793 avatar Aug 01 '25 04:08 pan3793