ray_beam_runner icon indicating copy to clipboard operation
ray_beam_runner copied to clipboard

Add watermark-based scheduling to the Ray Runner

Open pabloem opened this issue 3 years ago • 3 comments

The Ray Runner currently works by topologically sorting the pipeline graph, and executing stage by stage until the whole pipeline has been executed. This means that it only supports batch mode, and it can't execute multiple stages in parallel.

By implementing watermark-based scheduling, and by executing any bundle that is ready for execution, we can start gaining parallelism, and move towards streaming support.

This work is somewhat involved, because it requires changing the whole execution logic for the pipeline, however it should increase our parallelism, which will be great (https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py#L420-L487)

pabloem avatar Jun 08 '22 03:06 pabloem

Useful literature: https://s.apache.org/beam-fn-api

pabloem avatar Jun 14 '22 17:06 pabloem

what is our plan to add watermark-based scheduling? any reference implementation that we need to discuss? for example: flink?

wilsonwang371 avatar Jun 16 '22 18:06 wilsonwang371

Hi @pabloem @iasoon , is this PR https://github.com/ray-project/ray_beam_runner/pull/24 associated to this issue? I might be interested in picking this up.

rkenmi avatar Jan 18 '23 05:01 rkenmi