kfp-tekton icon indicating copy to clipboard operation
kfp-tekton copied to clipboard

Request: Ability to pass large pandas dataframes between pipeline components (without creating artifacts)

Open joeswashington opened this issue 2 years ago • 4 comments

We would like to have the ability to pass the results of a pandas dataframe operation from one pipeline component to another without having to create an input / output component.

In this case, we would need to make a CSV file in one component and share in the other component which is slow.

joeswashington avatar Sep 02 '21 19:09 joeswashington

What do you. mean of passing the results of a pandas dataframe? So if it's just internal in python I think you should include them in the same component. Tekton support passing data with results or workspace, and kfp support using artifacts, this is the standard way for sharing data between components, so maybe can consider how to split your logic.

pugangxa avatar Sep 09 '21 14:09 pugangxa

For @joeswashington 's use case, we probably need to invent a new custom task controller that trying to do similar things in Spark where the output of a pipeline task can be stored in the Spark driver's memory. These kind of use case usually is addressed in the Spark community instead of Tekton, so I would recommend to run all the data frame processing on a Spark cluster and use KFP-Tekton component as the Spark client.

Tomcli avatar Sep 09 '21 16:09 Tomcli

@joeswashington Are you sure your request is feasible?

The producer and consumer tasks probably run on different machines. So the producer need to send out the data using network and the consumer container needs to receive the data from network. Also, the producer and consumer run at different time (the consumer task is only started after the producer task finishes). So the data needs to be stored somewhere. The intermediate data storage is also important for cache reuse. You don't want to do the same data processing or training multiple times.

So, it looks like it's inevitable that the produced data is uploaded somewhere and downloaded when it needs to be consumed. You cannot really have a distributed system without passing data over network.

P.S. KFP has a way to seamlessly switch all data-passing to a Kubernetes volume, but we do not really see people using that feature. Kubernetes volumes are also accessed over the network...

Ark-kun avatar Nov 07 '21 06:11 Ark-kun

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 02 '22 08:03 stale[bot]