parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

Add more constructors to ParquetFileReader

Open yuzhu opened this issue 10 months ago • 4 comments

Describe the enhancement requested

It seems that we are moving towards InputFile instead of HadoopConf and path (indicated by the deprecation notice), but the constructors using InputFile are missing important ones, such as public ParquetFileReader(InputFile file, ParquetMetadata footer,ParquetReadOptions options, SeekableInputStream f) where an external footer can be passed in.

If it seems ok, I would like to add some additional constructors to bring the InputFile-based ones to parity with HadoopConf-based ones.

Component(s)

No response

yuzhu avatar Jan 30 '25 22:01 yuzhu

I think it makes sense to add this. Please feel free to create a PR.

wgtmac avatar Feb 03 '25 15:02 wgtmac

Can I suggest the builder pattern, allows for more evolution over time and allows every option to have a default value

steveloughran avatar Mar 13 '25 20:03 steveloughran

@steveloughran I have the same feeling. For the reader we already have ParquetReadOptions. The writer does not even have this parity.

wgtmac avatar Mar 14 '25 01:03 wgtmac

Image On the topic of `ParquetReadOptions`, while investigating the performance of ParquetFileReader ( I am using it to read a lot of small parquet files), I noticed that ParquetReadOptions are actually very expensive to construct, due to its dependency on Hadoop Configuration Parser. We added a clone method to ParquetReadOptions, let me know if you find it useful to have that as well. Or maybe we need a default constructor that does not depend on Hadoop...

yuzhu avatar Mar 18 '25 04:03 yuzhu