scio
scio copied to clipboard
A Scala API for Apache Beam and Google Cloud Dataflow.
## About this PR 📦 Updates [com.softwaremill.magnolia1_2:magnolia](https://github.com/softwaremill/magnolia) from `1.1.8` to `1.1.9` ## Usage ✅ **Please merge!** I'll automatically update this PR to resolve conflicts as long as you don't change...
WIP of Parquet projection/predicate support in JobTest. Alternately, we could provide custom assertions similar to `CoderAssertions`, i.e.: ```scala val record: AvroType = ??? record withPredicate(FilterApi.and(...)) withProjection(...schema...) should eq(...) ``` but...
A pipeline on Scio 0.14.3 migrated from applying Beam's SortValues transform directly: ```scala data .map { element => KV.of(..., KV.of(..., ...)) } .applyTransform(GroupByKey.create()) .setCoder(...) .applyTransform( SortValues.create( BufferedExternalSorter .options() .withMemoryMB(2047) .withExternalSorterType(ExternalSorter.Options.SorterType.NATIVE)...
Adds * `saveAsZstdDictionary` to train a Zstd dictionary on some arbitrary `SCollection[T]`. Estimates the average size of elements `T`, collects `n` elements based on a target training set size, then...
Given the following schema: ```json { "type": "record", "name": "TestRecord", "fields": [ { "name": "nullableDecimal", "type": [ "null", { "type": "bytes", "logicalType": "decimal", "precision": 4, "scale": 2 } ] }...
It would be nice if there were an easy API for users to test Parquet FilterPredicates against SCollections and assert on expected input/output. Would probably require materializing the data to...
When using `saveAsSparkey`, if any shard is > ~2gb then you will get a coder exception and something like ``` Error message from worker: org.apache.beam.sdk.util.UserCodeException: java.lang.OutOfMemoryError: Required array length 2147483639...
parquet-avro supports Schema projections that exclude required fields. However, if a required field is excluded, the Avro record will fail Coder roundtrip during the next PTransform. As a workaround in...
Add an option, either pipeline option or API method param, to instruct Scio to fall back to a non-SMB join if it can't perform an SMB join (due to incompatible...
Use correct avro writer constructor that would load the generated model with conversions. Setup CI to test scio-avro with latest version on main Fix #5331