parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

PARQUET-3031: Support to transfer input stream when building ParquetFileReader

Open turboFei opened this issue 1 year ago • 4 comments

Rationale for this change

Support to transfer the parquet file inputstream when building the ParquetFileReader, so that we can re-use the existing inputstream and reduce the open file rpcs.

What changes are included in this PR?

As title.

Are these changes tested?

Existing UT. It only a new constructors.

Are there any user-facing changes?

No break change.

Closes #3031

turboFei avatar Oct 10 '24 04:10 turboFei

On many file systems, a seek backwards to read the data after reading the footer results in slower reads because the fs switches from a sequential read to a random read (which typically turns off pre-fetching and other optimizations enabled in sequential reads). It might be worth considering if reusing the stream is worth it.

parthchandra avatar Oct 10 '24 17:10 parthchandra

Thanks parthchandra for the comments.

For our company internal managed spark, we reuse the inputstream for parquet file.

Before that:

A spark task will open the file multiple times to read footer and data.

When the HDFS nameNode is under high pressure, it will cost time.

After that, it only open the parquet file for one time.

turboFei avatar Oct 10 '24 17:10 turboFei

This is the testing 3years ago on Spark-2.3.

image It reduces 3/2 hdfs RPC requests to namenode.

And after this Spark patch in community [https://github.com/apache/spark/pull/39950]([SPARK-42388][SQL] Avoid parquet footer reads twice in vectorized reader), the solution might reduce 1/2 hdfs RPC requests.

turboFei avatar Oct 10 '24 17:10 turboFei

It looks reasonable to me and users can choose their best fit.

cc @gszadovszky @steveloughran

wgtmac avatar Oct 11 '24 16:10 wgtmac

@wgtmac @gszadovszky Could we merge this PR?

wangyum avatar Oct 31 '24 08:10 wangyum

Thank you all.

wangyum avatar Nov 01 '24 03:11 wangyum

For the record, I merged @Fokko 's Parquet 1.15.0 PR to Apache Spark repository.

To @turboFei and @wangyum , if you want, you can make a PR to use this new technique in Apache Spark as illustrated in the above comments.

Thank you!

dongjoon-hyun avatar Dec 03 '24 16:12 dongjoon-hyun