parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

GH-3080: HadoopStreams to support ByteBufferPositionedReadable

Open steveloughran opened this issue 1 year ago • 2 comments

Rationale for this change

If a stream declares in its StreamCapabilities that it supports ByteBufferPositionedReadable, then use that API for readFully(ByteBuffer)

ByteBufferPositionedReadable.readFully(long position, ByteBuffer buf)

Adding support for Hadoop ByteBufferPositionedReadable streams may improve performance by pushing retry/recovery logic into the filesystem client library.

This interface is implemented by the HDFS input stream; we are considering adding it elsewhere.

What changes are included in this PR?

  • New SeekableInputStream implementation: H3ByteBufferInputStream
  • Instantiated in HadoopStreams if the FSDataInputStream is considered suitable.
  • Tests for the new behavior and that no regressions are caused.

Class H3ByteBufferInputStream

The reading is done in a new class, H3ByteBufferInputStream, which subclasses H2ByteBufferInputStream. This reduces the amount of duplicate code, it just makes it a bit unclean.

The purist way to do it would be to create an abstract superclass HadoopInputStream to hold all commonality between the the three input streams.

I'm happy to do this, just didn't want to doing some larger refactoring without (a) showing the core design worked and (b) getting permission to do this. Should I do this?

HadoopStreams changes

Selection of the new input stream is done if and only if the stream declares the capability in:preadbytebuffer. There is no equivalent of isWrappedStreamByteBufferReadable() which recurses through a chain of wrapped streams looking for the API. If a stream doesn't declare its support for the API, it won't get picked up. This is done knowing that the sole production implemenation which currently exists, the HDFS input stream, does declare this capability.

Are these changes tested?

There is new test suite, for new behavior and ensuring that the integration with HadoopStreams still retains the correct behavior for existing streams. Suite is parameterized on heap and direct buffers.

Are there any user-facing changes?

No

Closes GH-3080

steveloughran avatar Dec 03 '24 20:12 steveloughran

I'm away until 2025; will reply to comments then. Thanks for the review.

steveloughran avatar Dec 20 '24 20:12 steveloughran

I'm back, don't think I've forgotten this. In fact I've been actually setting up a test-only-loop for hadoop for regression testing parquet support through the cloud connectors. https://github.com/apache/hadoop/pull/7285

steveloughran avatar Jan 14 '25 19:01 steveloughran

ah, did neglect this didn't I? will revisit.

steveloughran avatar Sep 22 '25 10:09 steveloughran

@steveloughran Is this something we want to get in 1.17.0? See https://lists.apache.org/thread/g1cngnkzhjt86yt4dfl078yrplfmzcf5

Fokko avatar Dec 02 '25 19:12 Fokko