scio icon indicating copy to clipboard operation
scio copied to clipboard

BatchDoFn and sio batch API on SCollection

Open RustedBones opened this issue 1 year ago • 1 comments

Amortize processing cost by local batching of elements Batching respects windowing

This aims to give symetric API with the KV batching in #4458 As the batch is emitted on finishBundle, no maxBufferingDuration is required

RustedBones avatar Aug 08 '22 16:08 RustedBones

Codecov Report

Merging #4489 (e207766) into main (36935c1) will decrease coverage by 0.29%. The diff coverage is 54.05%.

:exclamation: Current head e207766 differs from pull request most recent head 3e7625b. Consider uploading reports for the commit 3e7625b to get more accurate results

@@            Coverage Diff             @@
##             main    #4489      +/-   ##
==========================================
- Coverage   60.48%   60.19%   -0.30%     
==========================================
  Files         275      275              
  Lines        9882    10061     +179     
  Branches      438      840     +402     
==========================================
+ Hits         5977     6056      +79     
- Misses       3905     4005     +100     
Impacted Files Coverage Δ
...in/scala/com/spotify/scio/values/SCollection.scala 88.38% <0.00%> (-5.27%) :arrow_down:
...scala/com/spotify/scio/bigquery/MockBigQuery.scala 0.00% <0.00%> (ø)
...la/com/spotify/scio/bigquery/client/TableOps.scala 0.00% <0.00%> (ø)
...a/com/spotify/scio/testing/TransformOverride.scala 100.00% <100.00%> (ø)
...om/spotify/scio/elasticsearch/CoderInstances.scala 44.11% <0.00%> (-5.89%) :arrow_down:
...om/spotify/scio/elasticsearch/CoderInstances.scala 42.42% <0.00%> (-5.86%) :arrow_down:
...com/spotify/scio/bigquery/types/TypeProvider.scala 47.22% <0.00%> (-2.78%) :arrow_down:
...la/com/spotify/scio/bigquery/client/BigQuery.scala 22.44% <0.00%> (-2.56%) :arrow_down:
...rc/main/scala/com/spotify/scio/util/ScioUtil.scala 59.25% <0.00%> (-2.28%) :arrow_down:
...n/scala/com/spotify/scio/extra/annoy/package.scala 80.00% <0.00%> (-2.06%) :arrow_down:
... and 42 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov[bot] avatar Aug 08 '22 16:08 codecov[bot]

I ran some test on dataflow with gs://apache-beam-samples/shakespeare/kinglear.txt input:

  • fixed size batch are respected
  • weighted batches are respected

RustedBones avatar Aug 19 '22 14:08 RustedBones

@clairemcginty all comments should be addressed. I managed to trick the tests to get a single bundle. This ensures batching is working within the bundle

RustedBones avatar Aug 23 '22 12:08 RustedBones