ray_beam_runner icon indicating copy to clipboard operation
ray_beam_runner copied to clipboard

Parallelization: 'Reshuffle' data shared between stages

Open pabloem opened this issue 3 years ago • 2 comments

When a stage is executed (in ray_execute_bundle), its output can be immediately reshuffled so that its downstream processing can be parallelized.

When the upstream stage performs a write to GroupByKey, then we must group before reshuffling data (data belonging to the same key must be processed in the same worker).

If the upstream stage is not performing a GBK, then we can simply reshard everything without worrying about individual keys.

pabloem avatar Jun 14 '22 22:06 pabloem

@wilsonwang371 this is the task where we parallelize the processing of data : )

pabloem avatar Jun 14 '22 22:06 pabloem

Sounds good. I generally understand this. But regarding details, I will catch up with you guys by picking up and working on small tasks.

wilsonwang371 avatar Jun 16 '22 20:06 wilsonwang371