scio
scio copied to clipboard
A Scala API for Apache Beam and Google Cloud Dataflow.
While upgrading scio from 0.12 to latest 0.13.3 and Apache Beam from 2.41 to 2.50 I observed an issue that Beam's `GroupByKey` is not working anymore in tests as it...
When using the new Parquet SplittableDoFn implementation to read a large # of files, the [file metadata lookup](https://github.com/spotify/scio/blob/main/scio-parquet/src/main/scala/com/spotify/scio/parquet/read/ParquetReadFn.scala#L268) (required to break down individual files into parallelizable row groups) can be...
Fix #4955
We have implemented a custom JDBC parallel read API because at the time, beam did not support that. Now beam has [readWitPartition](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/jdbc/JdbcIO.html#readWithPartitions-org.apache.beam.sdk.values.TypeDescriptor-) supporting `Long` and `DateTime`. We should probably considering...
When users experience "job graph too large" issues, the problem is typically implicitly-derived coders or parquet config. We could potentially capture this exception and analyze the graph to automatically suggest...
I tried to use suffix to read only parquet files using `parquetAvroFile` method. I tried with both `suffix = ".parquet"` and `suffix = "parquet"`, none seemed to filter out a...
`sinkViaGenericRecords` and `RecordFormatter` are both [deprecated](https://github.com/apache/beam/pull/8418); instead, we should be adding an additional map step to explicitly convert from `ElementT => AvroT`. This affects some of our dynamic IO utilities.
Currently there's two implementations in use: https://github.com/spotify/scio/blob/e5384c7ce900027edf8b0b722a366fb77c4bb7b9/scio-tensorflow/src/main/scala/com/spotify/scio/tensorflow/TFRecordCodec.scala https://github.com/spotify/scio/blob/4201013fe24407d8fb3f0b0c480cdee58c0664f0/scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/TFRecordCodec.java
Right now they only cast underlying `SCollection`, which might not work for mixing `SCollection`s of ADTs. https://github.com/spotify/scio/blob/main/scio-core/src/main/scala/com/spotify/scio/values/SCollection.scala#L294 https://spotify-foss.slack.com/archives/C015189TFDH/p1620769273059800
e.g. unwindowed -> typical result from DefaultFilenamePolicy, windowed -> daily GCS dirs