scio icon indicating copy to clipboard operation
scio copied to clipboard

Support projections in ParquetAvroFileOperations/ParquetAvroSortedBucketIO

Open clairemcginty opened this issue 7 months ago • 1 comments

ParquetAvroFileOperations always overrides the "projection" option to equal the full reflected schema, so you can't supply a projection for a SpecificRecord class:

https://github.com/spotify/scio/blob/110f79593c67c58a2c2465bf2fb340ff4711003f/scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/ParquetAvroFileOperations.java#L175-L176

clairemcginty avatar Nov 17 '23 19:11 clairemcginty

#5083 provides a workaround for this via the Configuration parameter:

val projection: Schema = ...
val configuration = ParquetConfiguration.empty()
AvroReadSupport.setRequestedProjection(configuration, projection)

val read = ParquetAvroSortedBucketIO
  .read(tupleTag, classOf[TestRecord])
  .from(...)
  .withConfiguration(configuration)

In 0.14 we can add projection as a Builder method to ParquetAvroSortedBucketIO

clairemcginty avatar Nov 20 '23 14:11 clairemcginty