tfx Cache skipped with same execution properties

System information

Have I specified the code to reproduce the issue (Yes, No): No
Environment in which the code is executed (e.g., Local(Linux/MacOS/Windows), Interactive Notebook, Google Cloud, etc): Kubeflow Pipelines GKE deployment - KF 1.5 version
TensorFlow version: 2.5.1
TFX Version: 1.2.1
Python version: 3.7
Python dependencies (from pip freeze output):

Describe the current behavior

When running an ExampleGen with identical execution properties with dataflowrunner, cached gets skipped.

Describe the expected behavior

Cache gets hit.

Standalone code to reproduce the issue

No code, but we do have execution -> https://gist.github.com/casassg/cdc5e7ef216ceac90f49adc0b7721c11

Name of your Organization (Optional) Twitter

Other info / logs

It is important to note that we do see difference on beam_pipeline_args but we are not clear if thats gets used for cache validation or not

Sep 23 '22 23:09 casassg

Does the query string change?

Note that beam_pipeline_args differs as we built a container image and pushed it w 2 tags, however, I've been trying to go deep down the code in TFX that checks for executions but I have not been able to find a place for it to break cache.

Could you try rerunning with exactly the same beam_pipeline_args to see if it caches?

Sep 24 '22 00:09 rcrowe-google

Query string stays the same, had to scratch it from gist for privacy reasons but I can assure you it's the same. Also running the same beam_pipeline_args it caches (I cloned the run on UI to validate this).

I'm 90% sure this is due to beam_pipeline_args changing but this also seems quite non-intuitive. Why does cache get invalidate for an class property like this? Ideally that should only modify how it gets executed, but if its the same inputs/exec_properties it should hit cache if enabled. That said, I have not been able to figure out the logic this is hitting

Sep 27 '22 23:09 casassg

having same issue. context.run(example_gen, enable_cache=True) with Jupiter notebook is not using cache where following execution is using cache (executed as intelij python code).

metadata_connection_config = tfx.orchestration.metadata.sqlite_metadata_connection_config(METADATA_PATH.as_posix())
pipeline = tfx.dsl.Pipeline(
    pipeline_name=PIPELINE_NAME,
    pipeline_root=PIPELINE_ROOT.as_posix(),
    components=components,
    enable_cache=True,
    metadata_connection_config=metadata_connection_config
)
result = tfx.orchestration.LocalDagRunner().run(pipeline)

is there any way to enforce StatisticsGen to use given version of ExampleGen, use previous output.

Jun 20 '23 10:06 ismailsimsek

@casassg, @ismailsimsek ,

The cache key is generated by applying SHA-256 hashing function on:

Serialized pipeline info.
Serialized node_info of the PipelineNode.
Serialized executor spec
Serialized input artifacts if any.
Serialized output artifacts if any. The uri was removed during the process.
Serialized parameters if any.
Serialized module file content if module file is present in parameters.

Changing any of the above things results in invalidate the cache. Make sure the above things are constant and still after this if the cache gets skipped, Please let us know if the issue persists. Thank you!

Jun 28 '23 05:06 singhniraj08

Note that the issue is beam pipeline args being part of the cache (as those are execution configuration for Beam). Also, no longer in TFX so unfortunately can't test.

Jun 28 '23 18:06 casassg

Adding on to @singhniraj08 above why beam pipeline args is considered part of cache:

beam_pipeline_args is part of BeamExecutorSpec on BeamComponents(e.g. ExampleGen under discussion in this PR is a subclass of BeamComponent) https://github.com/tensorflow/tfx/blob/master/tfx/components/example_gen/csv_example_gen/component.py
executor_spec is part of the cache context when launching components https://github.com/tensorflow/tfx/blob/master/tfx/orchestration/portable/launcher.py#L371
Therefore, beam_pipeline_args will be considered when choosing cache.

Jun 29 '23 16:06 qingcan-google

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

Jul 08 '23 02:07 github-actions[bot]

This issue was closed due to lack of activity after being marked stale for past 7 days.

Jul 15 '23 02:07 github-actions[bot]

Are you satisfied with the resolution of your issue? Yes No

Jul 15 '23 02:07 github-actions[bot]

tfx tfx copied to clipboard

Cache skipped with same execution properties

tfx
tfx copied to clipboard