dataflow-java icon indicating copy to clipboard operation
dataflow-java copied to clipboard

Update code to use Dataflow's new support for custom sources

Open deflaux opened this issue 10 years ago • 1 comments

We manually create data shards right now via --references (or --allReferences) and --basesPerShard from ShardOptions.

Updating to custom sources will allow the data shards to be not only createed but also re-sharded dynamically.

https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/io/Source

https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/io/Source.java

deflaux avatar May 04 '15 18:05 deflaux

That would be awesome! Yeah, I mentioned something along these lines a while ago here, and very excited to see it:

https://github.com/googlegenomics/spark-examples/pull/49#issuecomment-61376803

Maybe the size of the requested region can be processed by a function to dynamically return and define the --basesPerShard.

Thanks and very excited to see the results! ~p

pgrosu avatar May 04 '15 19:05 pgrosu