dataflow-java
dataflow-java copied to clipboard
Update code to use Dataflow's new support for custom sources
We manually create data shards right now via --references (or --allReferences) and --basesPerShard from ShardOptions.
Updating to custom sources will allow the data shards to be not only createed but also re-sharded dynamically.
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/io/Source
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/io/Source.java
That would be awesome! Yeah, I mentioned something along these lines a while ago here, and very excited to see it:
https://github.com/googlegenomics/spark-examples/pull/49#issuecomment-61376803
Maybe the size of the requested region can be processed by a function to dynamically return and define the --basesPerShard.
Thanks and very excited to see the results! ~p