Supporting Broadcast data transfer strategy for Python UDF model

Open zuozhiw opened this issue 3 years ago • 0 comments

The 2-input Python UDF operator takes two inputs: "model" and "data". The "model" is one single tuple is generated from a source file with a single worker.

Without broadcast data transfer strategy, the python UDF cannot be parallelized, because the tuple won't be distributed to all workers with other strategies (hash partition, round-robin, etc..)

To implement the broadcast data transfer strategy, we can let the Python UDF operator specify the partioning as a requirement on each input port

Jul 29 '22 03:07 zuozhiw