data-prep-kit Adding pipeline transform, thus allowing to specify a sequential list of transforms, that will be executed passing data in memory.

Search before asking

[x] I searched the issues and found no similar issues.

Component

transdforms/Other

Feature

In the current implementation of DPK, each transform reads all the data and writes all of the execution results out. As a result, if you want to implement several transforms executing sequentially (either through KFP or notebook), there is a lot of IO. A proposed feature introduces a "pipeline transform", that defines a sequence of ordinary transforms that it executes passing data between them in memory, thus significantly reducing the amount of IO. In the cases when intermediate results are not required, usage of such transform can significantly speed up overall execution of transform usage.

An example of such implementation can be found here: https://github.com/The-AI-Alliance/dpk/commit/cd9fae008f5830a4ecaf2d23c665b8f50bb6d8a2

Are you willing to submit a PR?

[x] Yes I am willing to submit a PR!

Mar 21 '25 15:03 blublinsky

Longer-term, an even better optimization to support would be "operator fusion", a behind-the-scenes optimization where transforms can sometimes be combined together to perform several "steps" at the same time for an input record. (Spark does this, for example.)

Mar 21 '25 15:03 deanwampler

cc: @shahrokhDaijavad

Mar 24 '25 23:03 shahrokhDaijavad

"operator fusion" is nice, but I think the ability to do manual step combination is a necessary first step

Mar 25 '25 10:03 blublinsky

pending PR in #1329 and already raised in #1102

Jun 24 '25 20:06 swith005