scio icon indicating copy to clipboard operation
scio copied to clipboard

A Scala API for Apache Beam and Google Cloud Dataflow.

Results 213 scio issues
Sort by recently updated
recently updated
newest added

While upgrading scio from 0.12 to latest 0.13.3 and Apache Beam from 2.41 to 2.50 I observed an issue that Beam's `GroupByKey` is not working anymore in tests as it...

bug

When using the new Parquet SplittableDoFn implementation to read a large # of files, the [file metadata lookup](https://github.com/spotify/scio/blob/main/scio-parquet/src/main/scala/com/spotify/scio/parquet/read/ParquetReadFn.scala#L268) (required to break down individual files into parallelizable row groups) can be...

enhancement
parquet

We have implemented a custom JDBC parallel read API because at the time, beam did not support that. Now beam has [readWitPartition](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/jdbc/JdbcIO.html#readWithPartitions-org.apache.beam.sdk.values.TypeDescriptor-) supporting `Long` and `DateTime`. We should probably considering...

enhancement
io

When users experience "job graph too large" issues, the problem is typically implicitly-derived coders or parquet config. We could potentially capture this exception and analyze the graph to automatically suggest...

enhancement

I tried to use suffix to read only parquet files using `parquetAvroFile` method. I tried with both `suffix = ".parquet"` and `suffix = "parquet"`, none seemed to filter out a...

io

`sinkViaGenericRecords` and `RecordFormatter` are both [deprecated](https://github.com/apache/beam/pull/8418); instead, we should be adding an additional map step to explicitly convert from `ElementT => AvroT`. This affects some of our dynamic IO utilities.

blocked
deprecation

Currently there's two implementations in use: https://github.com/spotify/scio/blob/e5384c7ce900027edf8b0b722a366fb77c4bb7b9/scio-tensorflow/src/main/scala/com/spotify/scio/tensorflow/TFRecordCodec.scala https://github.com/spotify/scio/blob/4201013fe24407d8fb3f0b0c480cdee58c0664f0/scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/TFRecordCodec.java

enhancement
SMB

Right now they only cast underlying `SCollection`, which might not work for mixing `SCollection`s of ADTs. https://github.com/spotify/scio/blob/main/scio-core/src/main/scala/com/spotify/scio/values/SCollection.scala#L294 https://spotify-foss.slack.com/archives/C015189TFDH/p1620769273059800

bug
good first issue

e.g. unwindowed -> typical result from DefaultFilenamePolicy, windowed -> daily GCS dirs

enhancement