tfx icon indicating copy to clipboard operation
tfx copied to clipboard

Recommended way to reuse TFX pipeline artifacts within the same/ across different pipelines

Open calvinleungyk opened this issue 4 years ago • 3 comments

We would like to be able to reuse already produced pipeline artifacts in the same pipeline/ across pipelines via something like ResolveNode.

After running some components, we store the artifacts on GCS. Currently, if we want to use the artifacts, we need to create a mock component with @component, inject the URI of the artifacts and execute it so that the component is registered on MLMD. Otherwise the downstream components would not be aware of the artifacts and thus they cannot be reused.

Would be great if there's an easier way to achieve this so that we can share artifacts across pipelines easier, or recover from a notebook crash without needing to rerun pipeline components or hack with MockComponents.

calvinleungyk avatar Oct 07 '20 19:10 calvinleungyk

I see that ResolverNode is already considered in the offline discussion. I think ImporterNode is probably a better fit for your use case. There is some examples in our GCP integration tests which explains how we use that to import previous produced results and create two node pipelines ([Importer, to-test node]) for quick testing purpose.

This is still quite low level since we ask for a source_uri, and from the description it seems you'd like this be automated through some higher level MLMD query which resolves to a previous produced artifact?

zhitaoli avatar Oct 07 '20 20:10 zhitaoli

@calvinleungyk, Can you please respond to the above comment. Thanks!

rmothukuru avatar Nov 02 '20 08:11 rmothukuru

Hi @rmothukuru thanks for the reminder. It looks like the ImporterNode has a better user experience than what we're currently doing with mocking the components - will try it out soon. We have used the pipeline run context to fetch artifacts for another purposes, but that required the run_id in the PipelineInfo. Would like a way to do this with interactive pipeline runs too.

Just like what @zhitaoli said, it would be great if there's a higher level wrapper to fetch previously produced artifacts. Ideally, if there's a way where we can use a pipeline root or run_id in something like:

schema_gen = fetch_existing_artifact(name='statistics') or SchemaGen(statistics=statistics_gen.output['statistics'])

or even better:

schema_gen = SchemaGen(statistics=statistics_gen.output['statistics'], fetch_existing_artifact=True)

calvinleungyk avatar Nov 03 '20 17:11 calvinleungyk

@calvinleungyk,

Are you still looking for a resolution? We are planning on prioritising the issues based on the community interests. Please let us know if this issue still persists with the latest TFX 1.13 release so that we can work on fixing it. Thank you for your contributions.

singhniraj08 avatar May 05 '23 10:05 singhniraj08

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

github-actions[bot] avatar May 13 '23 01:05 github-actions[bot]

This issue was closed due to lack of activity after being marked stale for past 7 days.

github-actions[bot] avatar May 20 '23 01:05 github-actions[bot]