data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

Adding pipeline transform, thus allowing to specify a sequential list of transforms, that will be executed passing data in memory.

Open blublinsky opened this issue 7 months ago • 3 comments

Search before asking

  • [x] I searched the issues and found no similar issues.

Component

transdforms/Other

Feature

In the current implementation of DPK, each transform reads all the data and writes all of the execution results out. As a result, if you want to implement several transforms executing sequentially (either through KFP or notebook), there is a lot of IO. A proposed feature introduces a "pipeline transform", that defines a sequence of ordinary transforms that it executes passing data between them in memory, thus significantly reducing the amount of IO. In the cases when intermediate results are not required, usage of such transform can significantly speed up overall execution of transform usage.

An example of such implementation can be found here: https://github.com/The-AI-Alliance/dpk/commit/cd9fae008f5830a4ecaf2d23c665b8f50bb6d8a2

Are you willing to submit a PR?

  • [x] Yes I am willing to submit a PR!

blublinsky avatar Mar 21 '25 15:03 blublinsky