Pablo
Pablo
TODO @pabloem - what is the produced watermark and why do stages differentiate
once we start parallelizing the execution of bundles, our watermarking will need to change (we need to wait for all parallel tasks of a given stage before advancing its watermark...
that would be great! We can track performance using this action: https://github.com/benchmark-action/github-action-benchmark I don't think it needs to be very big. I think if it processes 1GB running locally with...
We have a few microbenchmarks in Beam that you could use as inspiration, but I don't think they're big enough to test our runner and optimize over time: - https://github.com/apache/beam/blob/master/sdks/python/apache_beam/tools/fn_api_runner_microbenchmark.py...
Ray Java resources: - https://docs.ray.io/en/latest/ray-core/configure.html#java-applications - https://docs.ray.io/en/latest/ray-core/cross-language.html#cross-language - https://docs.ray.io/en/latest/ray-core/package-ref.html fyi @iasoon @valiantljk this issue is more complex than the other stuff you've tried, but it should help move one of...
yes, we would have to add support for expanding java PTransforms. I think we can limit the scope of this quite a bit while still delivering SQL execution.
@wilsonwang371 this is the task where we parallelize the processing of data : )
@valiantljk this is something that you could consider taking a stab with? : ) you'd have to add some smart code so that ongoing `ray_execute_bundle` tasks can report progress -...
lmk if that helps
Useful literature: https://s.apache.org/beam-fn-api