Neville Li

Results 45 comments of Neville Li

More specifically one can write nested case class as Parquet file with `parquet-types` and might want to read them as TF `Example` which doesn't support nesting. Alternatively we can ask...

@raunaqmorarka CLA approved & everything passed excepted one due to possibly network issue?

@regadas WDYT? @mdvorsky if you think it's a trivial change, mind submitting a PR and we can discuss there?

WIP here: https://github.com/spotify/scio/tree/neville/proto-smb Not sure if we really need this given `.protobuf.avro` is not a standard format plus we worked around it internally. Will leave this on hold for now.

This is sort of by design. `sc.ParquetAvroFile(path)` doesn't initiate the IO right way, `.map(f)` does, together with `f` as a projection function since the projected Avro records might be incomplete...

Agree that a new `STransform` might be too complex. The goal of this is to allow power users to dynamically produce job graph for certain tricky transforms like join based...

Beam SQL's `BeamSqlTable` interface actually has table statistics notion, so maybe we can leverage that. Then again graph optimizations are probably easier in SQL than scala code. https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/BeamTableStatistics.java

After discussion IRL. This should be handled in Beam SQL. Too much complexity adding it to the Scala layer. Will leave this open as a reminder.

https://github.com/spotify/scio/blob/master/scio-extra/src/main/scala/com/spotify/scio/extra/csv/CsvIO.scala#L202 Should be possible since we have a custom read `DoFn` instead of the generic line delimited `TextIO`. I suspect you'll have to add some implicit arguments to propagate the...

This is a streaming job I assume? By timeout, do you mean overwriting the same file URIs and have `DistCache` instances re-downloading them? This might lead to race condition and...