tfx icon indicating copy to clipboard operation
tfx copied to clipboard

Cache skipped with same execution properties

Open casassg opened this issue 2 years ago • 2 comments

System information

  • Have I specified the code to reproduce the issue (Yes, No): No
  • Environment in which the code is executed (e.g., Local(Linux/MacOS/Windows), Interactive Notebook, Google Cloud, etc): Kubeflow Pipelines GKE deployment - KF 1.5 version
  • TensorFlow version: 2.5.1
  • TFX Version: 1.2.1
  • Python version: 3.7
  • Python dependencies (from pip freeze output):

Describe the current behavior

When running an ExampleGen with identical execution properties with dataflowrunner, cached gets skipped.

Describe the expected behavior

Cache gets hit.

Standalone code to reproduce the issue

No code, but we do have execution -> https://gist.github.com/casassg/cdc5e7ef216ceac90f49adc0b7721c11

Name of your Organization (Optional) Twitter

Other info / logs

It is important to note that we do see difference on beam_pipeline_args but we are not clear if thats gets used for cache validation or not

casassg avatar Sep 23 '22 23:09 casassg

Does the query string change?

Note that beam_pipeline_args differs as we built a container image and pushed it w 2 tags, however, I've been trying to go deep down the code in TFX that checks for executions but I have not been able to find a place for it to break cache.

Could you try rerunning with exactly the same beam_pipeline_args to see if it caches?

rcrowe-google avatar Sep 24 '22 00:09 rcrowe-google

Query string stays the same, had to scratch it from gist for privacy reasons but I can assure you it's the same. Also running the same beam_pipeline_args it caches (I cloned the run on UI to validate this).

I'm 90% sure this is due to beam_pipeline_args changing but this also seems quite non-intuitive. Why does cache get invalidate for an class property like this? Ideally that should only modify how it gets executed, but if its the same inputs/exec_properties it should hit cache if enabled. That said, I have not been able to figure out the logic this is hitting

casassg avatar Sep 27 '22 23:09 casassg

having same issue. context.run(example_gen, enable_cache=True) with Jupiter notebook is not using cache where following execution is using cache (executed as intelij python code).

metadata_connection_config = tfx.orchestration.metadata.sqlite_metadata_connection_config(METADATA_PATH.as_posix())
pipeline = tfx.dsl.Pipeline(
    pipeline_name=PIPELINE_NAME,
    pipeline_root=PIPELINE_ROOT.as_posix(),
    components=components,
    enable_cache=True,
    metadata_connection_config=metadata_connection_config
)
result = tfx.orchestration.LocalDagRunner().run(pipeline)

is there any way to enforce StatisticsGen to use given version of ExampleGen, use previous output.

ismailsimsek avatar Jun 20 '23 10:06 ismailsimsek

@casassg, @ismailsimsek ,

The cache key is generated by applying SHA-256 hashing function on:

  • Serialized pipeline info.
  • Serialized node_info of the PipelineNode.
  • Serialized executor spec
  • Serialized input artifacts if any.
  • Serialized output artifacts if any. The uri was removed during the process.
  • Serialized parameters if any.
  • Serialized module file content if module file is present in parameters.

Changing any of the above things results in invalidate the cache. Make sure the above things are constant and still after this if the cache gets skipped, Please let us know if the issue persists. Thank you!

singhniraj08 avatar Jun 28 '23 05:06 singhniraj08

Note that the issue is beam pipeline args being part of the cache (as those are execution configuration for Beam). Also, no longer in TFX so unfortunately can't test.

casassg avatar Jun 28 '23 18:06 casassg

Adding on to @singhniraj08 above why beam pipeline args is considered part of cache:

  1. beam_pipeline_args is part of BeamExecutorSpec on BeamComponents(e.g. ExampleGen under discussion in this PR is a subclass of BeamComponent) https://github.com/tensorflow/tfx/blob/master/tfx/components/example_gen/csv_example_gen/component.py

  2. executor_spec is part of the cache context when launching components https://github.com/tensorflow/tfx/blob/master/tfx/orchestration/portable/launcher.py#L371

  3. Therefore, beam_pipeline_args will be considered when choosing cache.

qingcan-google avatar Jun 29 '23 16:06 qingcan-google

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

github-actions[bot] avatar Jul 08 '23 02:07 github-actions[bot]

This issue was closed due to lack of activity after being marked stale for past 7 days.

github-actions[bot] avatar Jul 15 '23 02:07 github-actions[bot]

Are you satisfied with the resolution of your issue? Yes No

github-actions[bot] avatar Jul 15 '23 02:07 github-actions[bot]