tfx icon indicating copy to clipboard operation
tfx copied to clipboard

Pipeline exception when pipeline_root points to s3

Open lukesolo opened this issue 3 years ago • 2 comments

System information

  • Have I specified the code to reproduce the issue (Yes, No): Yes
  • Environment in which the code is executed (e.g., Local(Linux/MacOS/Windows), Interactive Notebook, Google Cloud, etc): Local Linux
  • TensorFlow version: 2.6.2
  • TFX Version: 1.3.3
  • Python version: 3.8.12
  • Python dependencies (from pip freeze output):
absl-py==0.12.0
apache-beam==2.33.0
argon2-cffi==21.1.0
astunparse==1.6.3
attrs==20.3.0
avro-python3==1.9.2.1
backcall==0.2.0
bleach==4.1.0
boto3==1.20.0
botocore==1.23.1
cachetools==4.2.4
certifi==2021.10.8
cffi==1.15.0
charset-normalizer==2.0.7
clang==5.0
click==7.1.2
crcmod==1.7
debugpy==1.5.1
decorator==5.1.0
defusedxml==0.7.1
dill==0.3.1.1
docker==4.4.4
docopt==0.6.2
entrypoints==0.3
fastavro==1.4.7
fasteners==0.16.3
flatbuffers==1.12
future==0.18.2
gast==0.4.0
google-api-core==1.31.4
google-api-python-client==1.12.8
google-apitools==0.5.31
google-auth==1.35.0
google-auth-httplib2==0.1.0
google-auth-oauthlib==0.4.6
google-cloud-aiplatform==1.7.0
google-cloud-bigquery==2.30.1
google-cloud-bigtable==1.7.0
google-cloud-core==1.7.2
google-cloud-datastore==1.15.3
google-cloud-dlp==1.0.0
google-cloud-language==1.3.0
google-cloud-pubsub==1.7.0
google-cloud-recommendations-ai==0.2.0
google-cloud-spanner==1.19.1
google-cloud-storage==1.42.3
google-cloud-videointelligence==1.16.1
google-cloud-vision==1.0.0
google-crc32c==1.3.0
google-pasta==0.2.0
google-resumable-media==2.1.0
googleapis-common-protos==1.53.0
grpc-google-iam-v1==0.12.3
grpcio==1.41.1
grpcio-gcp==0.2.2
h5py==3.1.0
hdfs==2.6.0
httplib2==0.19.1
idna==3.3
importlib-resources==5.4.0
ipykernel==6.5.0
ipython==7.29.0
ipython-genutils==0.2.0
ipywidgets==7.6.5
jedi==0.18.0
Jinja2==3.0.2
jmespath==0.10.0
joblib==0.14.1
jsonschema==4.2.1
jupyter-client==7.0.6
jupyter-core==4.9.1
jupyterlab-pygments==0.1.2
jupyterlab-widgets==1.0.2
keras==2.6.0
Keras-Preprocessing==1.1.2
keras-tuner==1.1.0
kt-legacy==1.0.4
kubernetes==12.0.1
Markdown==3.3.4
MarkupSafe==2.0.1
matplotlib-inline==0.1.3
mistune==0.8.4
ml-metadata==1.3.0
ml-pipelines-sdk==1.3.3
nbclient==0.5.5
nbconvert==6.2.0
nbformat==5.1.3
nest-asyncio==1.5.1
notebook==6.4.5
numpy==1.19.5
oauth2client==4.1.3
oauthlib==3.1.1
opt-einsum==3.3.0
orjson==3.6.4
packaging==20.9
pandas==1.3.4
pandocfilters==1.5.0
parso==0.8.2
pexpect==4.8.0
pickleshare==0.7.5
portpicker==1.5.0
prometheus-client==0.12.0
prompt-toolkit==3.0.22
proto-plus==1.19.7
protobuf==3.19.1
psutil==5.8.0
ptyprocess==0.7.0
pyarrow==2.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.21
pydot==1.4.2
Pygments==2.10.0
pymongo==3.12.1
pyparsing==2.4.7
pyrsistent==0.18.0
python-dateutil==2.8.2
pytz==2021.3
PyYAML==5.4.1
pyzmq==22.3.0
requests==2.26.0
requests-oauthlib==1.3.0
rsa==4.7.2
s3transfer==0.5.0
scipy==1.7.2
Send2Trash==1.8.0
six==1.15.0
tensorboard==2.6.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.0
tensorflow==2.6.2
tensorflow-data-validation==1.3.0
tensorflow-estimator==2.6.0
tensorflow-hub==0.12.0
tensorflow-io==0.21.0
tensorflow-io-gcs-filesystem==0.21.0
tensorflow-metadata==1.2.0
tensorflow-model-analysis==0.34.1
tensorflow-serving-api==2.6.2
tensorflow-transform==1.3.0
termcolor==1.1.0
terminado==0.12.1
testpath==0.5.0
tfx==1.3.3
tfx-bsl==1.3.0
tornado==6.1
traitlets==5.1.1
typing-extensions==3.7.4.3
uritemplate==3.0.1
urllib3==1.26.7
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==1.2.1
Werkzeug==2.0.2
widgetsnbextension==3.5.2
wrapt==1.12.1
zipp==3.6.0

Describe the current behavior Pipeline run fails with error: ValueError: Unexpected type <class 'list'> when it tries to recover from the error and tries to log about it: tensorflow.python.framework.errors_impl.NotFoundError: Object s3://bucket/prefix/CsvExampleGen/.system/executor_execution/1/.temp/ does not exist

Describe the expected behavior Pipeline artifacts should be generated in the s3://bucket/prefix/ directory

Standalone code to reproduce the issue ./in/test.csv

name1,name2
1,3
2,4

./main.py

from tfx import v1 as tfx
from tfx.components import CsvExampleGen
import tensorflow_io  # init s3:// protocol support


def create_pipeline():
    example_gen = CsvExampleGen(input_base="./in")
    return tfx.dsl.Pipeline(
        pipeline_name="test",
        pipeline_root="s3://bucket/prefix/",
        components=[example_gen],
        metadata_connection_config=tfx.orchestration.metadata.sqlite_metadata_connection_config(
            "./metadata.db"
        ),
    )


pipeline = create_pipeline()
tfx.orchestration.LocalDagRunner().run(pipeline)

Other info / logs

WARNING:root:Make sure that locally built Python SDK docker image has Python 3.8 interpreter.
WARNING:apache_beam.io.tfrecordio:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.
Traceback (most recent call last):
  File "/home/lukesolo/tfx-s3/.venv/lib/python3.8/site-packages/tfx/dsl/io/plugins/tensorflow_gfile.py", line 97, in rmtree
    tf.io.gfile.rmtree(path)
  File "/home/lukesolo/tfx-s3/.venv/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 677, in delete_recursively_v2
    _pywrap_file_io.DeleteRecursively(compat.path_to_bytes(path))
tensorflow.python.framework.errors_impl.NotFoundError: Object s3://bucket/prefix/CsvExampleGen/.system/executor_execution/1/.temp/ does not exist

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/lukesolo/tfx-s3/.venv/lib/python3.8/site-packages/tfx/orchestration/portable/launcher.py", line 466, in _clean_up_stateless_execution_info
    fileio.rmtree(execution_info.tmp_dir)
  File "/home/lukesolo/tfx-s3/.venv/lib/python3.8/site-packages/tfx/dsl/io/fileio.py", line 105, in rmtree
    _get_filesystem(path).rmtree(path)
  File "/home/lukesolo/tfx-s3/.venv/lib/python3.8/site-packages/tfx/dsl/io/plugins/tensorflow_gfile.py", line 99, in rmtree
    raise filesystem.NotFoundError() from e
tfx.dsl.io.filesystem.NotFoundError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 19, in <module>
    tfx.orchestration.LocalDagRunner().run(pipeline)
  File "/home/lukesolo/tfx-s3/.venv/lib/python3.8/site-packages/tfx/orchestration/local/local_dag_runner.py", line 90, in run
    component_launcher.launch()
  File "/home/lukesolo/tfx-s3/.venv/lib/python3.8/site-packages/tfx/orchestration/portable/launcher.py", line 554, in launch
    self._clean_up_stateless_execution_info(execution_info)
  File "/home/lukesolo/tfx-s3/.venv/lib/python3.8/site-packages/tfx/orchestration/portable/launcher.py", line 472, in _clean_up_stateless_execution_info
    execution_info.tmp_dir, execution_info.to_proto())
  File "/home/lukesolo/tfx-s3/.venv/lib/python3.8/site-packages/tfx/orchestration/portable/data_types.py", line 69, in to_proto
    execution_properties=data_types_utils.build_metadata_value_dict(
  File "/home/lukesolo/tfx-s3/.venv/lib/python3.8/site-packages/tfx/orchestration/data_types_utils.py", line 84, in build_metadata_value_dict
    result[k] = set_metadata_value(value, v)
  File "/home/lukesolo/tfx-s3/.venv/lib/python3.8/site-packages/tfx/orchestration/data_types_utils.py", line 235, in set_metadata_value
    set_parameter_value(parameter_value, value, set_schema=False)
  File "/home/lukesolo/tfx-s3/.venv/lib/python3.8/site-packages/tfx/orchestration/data_types_utils.py", line 304, in set_parameter_value
    parameter_value.field_value.string_value = get_value_and_set_type(
  File "/home/lukesolo/tfx-s3/.venv/lib/python3.8/site-packages/tfx/orchestration/data_types_utils.py", line 287, in get_value_and_set_type
    raise ValueError('Unexpected type %s' % type(value))
ValueError: Unexpected type <class 'list'>

lukesolo avatar Nov 09 '21 13:11 lukesolo

@lukesolo This is because your tensorflow_io import only exists in your pipeline definition file, not in the ExampleGen component which is run in another process or container.

This is fixed in TFX 1.8.

ConverJens avatar Jun 03 '22 08:06 ConverJens

@ConverJens this is still happening in TFX 1.9.0 I'm using KubeflowDagRunner

dvaldivia avatar Jul 18 '22 17:07 dvaldivia