ray_beam_runner
ray_beam_runner copied to clipboard
Parallelization: 'Reshuffle' data shared between stages
When a stage is executed (in ray_execute_bundle), its output can be immediately reshuffled so that its downstream processing can be parallelized.
When the upstream stage performs a write to GroupByKey, then we must group before reshuffling data (data belonging to the same key must be processed in the same worker).
If the upstream stage is not performing a GBK, then we can simply reshard everything without worrying about individual keys.
@wilsonwang371 this is the task where we parallelize the processing of data : )
Sounds good. I generally understand this. But regarding details, I will catch up with you guys by picking up and working on small tasks.