data-prep-kit
data-prep-kit copied to clipboard
Adding pipeline transform, thus allowing to specify a sequential list of transforms, that will be executed passing data in memory.
Search before asking
- [x] I searched the issues and found no similar issues.
Component
transdforms/Other
Feature
In the current implementation of DPK, each transform reads all the data and writes all of the execution results out. As a result, if you want to implement several transforms executing sequentially (either through KFP or notebook), there is a lot of IO. A proposed feature introduces a "pipeline transform", that defines a sequence of ordinary transforms that it executes passing data between them in memory, thus significantly reducing the amount of IO. In the cases when intermediate results are not required, usage of such transform can significantly speed up overall execution of transform usage.
An example of such implementation can be found here: https://github.com/The-AI-Alliance/dpk/commit/cd9fae008f5830a4ecaf2d23c665b8f50bb6d8a2
Are you willing to submit a PR?
- [x] Yes I am willing to submit a PR!