trino icon indicating copy to clipboard operation
trino copied to clipboard

Enable Parquet reader to use SeekableInputStream

Open leetcode-1533 opened this issue 2 years ago • 2 comments

Description

Enable Parquet reader to "seek" in the input stream. Part of https://github.com/trinodb/trino/issues/9471 When implementing Bloomfilter reader for parquet, as specified in https://github.com/apache/parquet-mr/commit/806037c080dc477798d157cd4a54a81240a85d37#diff-227f2b038d111d090ba898611889833804545f5c1e5c2d88a43a98a05b2970eb.

The bloom_filter_offset in thrift specified the "offset" of the bloomfilter header, but it does not specify the "length" of the bloomfilter header. (https://github.com/apache/parquet-format/blob/master/BloomFilter.md)

However, the Thrift generated java code can read in only necessary amount of data, when the stream starting at the "offset", in Parquet MR, this was done at: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1303 which used SeekableInputStream.

I am also open to refactoring the suggestion of removing org.apache.iceberg.io.SeekableInputStream from the TrinoInput interface.

I also removed HdfsParquetDataSource.java, which is currently only used by unit test.

Without this change, I need to do multiple reads until successfully serialized the object, like Impala code using binary search: https://github.com/apache/impala/blob/522ee1fcc09d47074c75440ba4fc3d258e1c95b3/be/src/exec/parquet/hdfs-parquet-scanner.h#L894.

Non-technical explanation

Enable Parquet reader to "seek" in the input stream.

Release notes

(x) This is not user-visible or docs only and no release notes are required. ( ) Release notes are required, please propose a release note for me. ( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

leetcode-1533 avatar Sep 19 '22 23:09 leetcode-1533

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: yluan. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot[bot] avatar Sep 19 '22 23:09 cla-bot[bot]

The bloom_filter_offset in thrift specified the "offset" of the bloomfilter header, but it does not specify the "length" of the

Since you don't know length, how would you control amount of data fetched from input stream? Is it arbitrary?

Does bloom filter have predictable size that we can use? Can you read in chunks (starting from given offset) until you see end of filter?

sopel39 avatar Sep 21 '22 14:09 sopel39

The bloom_filter_offset in thrift specified the "offset" of the bloomfilter header, but it does not specify the "length" of the

Since you don't know length, how would you control amount of data fetched from input stream? Is it arbitrary?

Does bloom filter have predictable size that we can use? Can you read in chunks (starting from given offset) until you see end of filter?

The spec https://github.com/apache/parquet-format/blob/master/BloomFilter.md does not specify the length of the read needed for getting the bloom filter of each column, it gives only the start offset. The bloom filters may also be stored either after all the row groups and before the start of the footer and page indexes, or it can be stored before start of every row group. We can derive a length for the read based on whether the bloom filter offset is less than start of 1st row group or greater than end of last row group. In the former case we can read upto starts of the selected row groups, in the latter we can read upto start of page indexes or footer (which comes first). So this is doable without resorting to streaming read APIs.

raunaqmorarka avatar Sep 27 '22 10:09 raunaqmorarka

@raunaqmorarka is there some performance penalty when doing streaming read APIs?

leetcode-1533 avatar Sep 27 '22 19:09 leetcode-1533

@raunaqmorarka is there some performance penalty when doing streaming read APIs?

There is a description of the problems encountered with streaming reads in https://trino.io/blog/2019/05/06/faster-s3-reads.html

raunaqmorarka avatar Oct 03 '22 07:10 raunaqmorarka