parquet-java PARQUET-3031: Support to transfer input stream when building ParquetFileReader

Rationale for this change

Support to transfer the parquet file inputstream when building the ParquetFileReader, so that we can re-use the existing inputstream and reduce the open file rpcs.

What changes are included in this PR?

As title.

Are these changes tested?

Existing UT. It only a new constructors.

Are there any user-facing changes?

No break change.

Closes #3031

Oct 10 '24 04:10 turboFei

On many file systems, a seek backwards to read the data after reading the footer results in slower reads because the fs switches from a sequential read to a random read (which typically turns off pre-fetching and other optimizations enabled in sequential reads). It might be worth considering if reusing the stream is worth it.

Oct 10 '24 17:10 parthchandra

Thanks parthchandra for the comments.

For our company internal managed spark, we reuse the inputstream for parquet file.

Before that:

A spark task will open the file multiple times to read footer and data.

When the HDFS nameNode is under high pressure, it will cost time.

After that, it only open the parquet file for one time.

Oct 10 '24 17:10 turboFei

This is the testing 3years ago on Spark-2.3.

It reduces 3/2 hdfs RPC requests to namenode.

And after this Spark patch in community [https://github.com/apache/spark/pull/39950]([SPARK-42388][SQL] Avoid parquet footer reads twice in vectorized reader), the solution might reduce 1/2 hdfs RPC requests.

Oct 10 '24 17:10 turboFei

It looks reasonable to me and users can choose their best fit.

cc @gszadovszky @steveloughran

Oct 11 '24 16:10 wgtmac

@wgtmac @gszadovszky Could we merge this PR?

Oct 31 '24 08:10 wangyum

Thank you all.

Nov 01 '24 03:11 wangyum

For the record, I merged @Fokko 's Parquet 1.15.0 PR to Apache Spark repository.

To @turboFei and @wangyum , if you want, you can make a PR to use this new technique in Apache Spark as illustrated in the above comments.

Thank you!

Dec 03 '24 16:12 dongjoon-hyun