DataflowJavaSDK icon indicating copy to clipboard operation
DataflowJavaSDK copied to clipboard

Unbounded file sink

Open dhalperi opened this issue 10 years ago • 0 comments
trafficstars

A common request is the ability in streaming pipelines to publish data to a file per window, like already exists for BigQueryIO.

We should add a ParDo-based example for this, and we also need to supply a custom file source that works in streaming mode (TextIO does not).

A few of the subtleties that come from streaming operation:

  • For fault tolerance, each bundle is a different unit of isolation. Unlike in batch, we can't just retry the entire step if any bundle fails. So we can't simply append to the same destination file in different bundles.

    We need to create separate temporary files for each bundle, marking them as permanent/successful in an idempotent way in a subsequent step.

  • If we have many elements, many workers, or many windows concurrently open, this could end up creating lots of bundles.

    GroupByKey and using large windows will enable us to have fewer bundles, but more elements per key-window group. Design needs to scale to large values.

dhalperi avatar Nov 11 '15 05:11 dhalperi