[DataflowRuntimeException] ImportError: No module named tfdv.statistics.stats_impl
Context
When running tfdv.generate_statistics_from_tfrecord on Dataflow, the job gets submitted successfully to the cluster but I get a:
ImportError: No module named tensorflow_data_validation.statistics.stats_impl during the job unpickling phase in the Dataflow worker
Error trace
---------------------------------------------------------------------------
DataflowRuntimeException Traceback (most recent call last)
<ipython-input-23-8f1147effd88> in <module>()
16 # for more options about stats, run `?tfdv.generate_statistics_from_tfrecord`
17 tfdv.generate_statistics_from_tfrecord(TFRECORDS_PATH,
---> 18 pipeline_options=pipeline_options)
/Users/romain/dev/venv/lib/python2.7/site-packages/tensorflow_data_validation/utils/stats_gen_lib.pyc in generate_statistics_from_tfrecord(data_location, output_path, stats_options, pipeline_options)
86 shard_name_template='',
87 coder=beam.coders.ProtoCoder(
---> 88 statistics_pb2.DatasetFeatureStatisticsList)))
89 return load_statistics(output_path)
90
/Users/romain/dev/venv/lib/python2.7/site-packages/apache_beam/pipeline.pyc in __exit__(self, exc_type, exc_val, exc_tb)
421 def __exit__(self, exc_type, exc_val, exc_tb):
422 if not exc_type:
--> 423 self.run().wait_until_finish()
424
425 def visit(self, visitor):
/Users/romain/dev/venv/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.pyc in wait_until_finish(self, duration)
1164 raise DataflowRuntimeException(
1165 'Dataflow pipeline failed. State: %s, Error:\n%s' %
-> 1166 (self.state, getattr(self._runner, 'last_error_msg', None)), self)
1167 return self.state
1168
DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 642, in do_work
work_executor.execute()
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 130, in execute
test_shuffle_sink=self._test_shuffle_sink)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 104, in create_operation
is_streaming=False)
File "apache_beam/runners/worker/operations.py", line 636, in apache_beam.runners.worker.operations.create_operation
op = create_pgbk_op(name_context, spec, counter_factory, state_sampler)
File "apache_beam/runners/worker/operations.py", line 482, in apache_beam.runners.worker.operations.create_pgbk_op
return PGBKCVOperation(step_name, spec, counter_factory, state_sampler)
File "apache_beam/runners/worker/operations.py", line 538, in apache_beam.runners.worker.operations.PGBKCVOperation.__init__
fn, args, kwargs = pickler.loads(self.spec.combine_fn)[:3]
File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 246, in loads
return dill.loads(s)
File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 316, in loads
return load(file, ignore)
File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 304, in load
obj = pik.load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatch[key](self)
File "/usr/lib/python2.7/pickle.py", line 1096, in load_global
klass = self.find_class(module, name)
File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 465, in find_class
return StockUnpickler.find_class(self, module, name)
File "/usr/lib/python2.7/pickle.py", line 1130, in find_class
__import__(module)
ImportError: No module named tensorflow_data_validation.statistics.stats_impl
What code did I run?
!pip install -U tensorflow \
tensorflow-data-validation \
apache-beam[gcp]
import tensorflow_data_validation as tfdv
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, SetupOptions
# Create and set your PipelineOptions.
pipeline_options = PipelineOptions()
# For Cloud execution, set the Cloud Platform project, job_name,
# staging location, temp_location and specify DataflowRunner.
google_cloud_options = pipeline_options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT_ID
google_cloud_options.job_name = JOB_NAME
google_cloud_options.staging_location = GCS_STAGING_LOCATION
google_cloud_options.temp_location = GCS_TMP_LOCATION
pipeline_options.view_as(StandardOptions).runner = 'DataflowRunner'
tfdv.generate_statistics_from_tfrecord(TFRECORDS_PATH,
pipeline_options=pipeline_options)
Pip trace
Requirement already up-to-date: tensorflow in /Users/romain/dev/venv/lib/python2.7/site-packages (1.12.0)
Requirement already up-to-date: tensorflow-data-validation in /Users/romain/dev/venv/lib/python2.7/site-packages (0.11.0)
Requirement already up-to-date: apache-beam[gcp] in /Users/romain/dev/venv/lib/python2.7/site-packages (2.8.0)
Requirement already satisfied, skipping upgrade: enum34>=1.1.6 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.1.6)
Requirement already satisfied, skipping upgrade: keras-preprocessing>=1.0.5 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.0.5)
Requirement already satisfied, skipping upgrade: wheel in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (0.31.1)
Requirement already satisfied, skipping upgrade: astor>=0.6.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (0.7.1)
Requirement already satisfied, skipping upgrade: backports.weakref>=1.0rc1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.0.post1)
Requirement already satisfied, skipping upgrade: mock>=2.0.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (2.0.0)
Requirement already satisfied, skipping upgrade: tensorboard<1.13.0,>=1.12.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.12.0)
Requirement already satisfied, skipping upgrade: termcolor>=1.1.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.1.0)
Requirement already satisfied, skipping upgrade: protobuf>=3.6.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (3.6.1)
Requirement already satisfied, skipping upgrade: gast>=0.2.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (0.2.0)
Requirement already satisfied, skipping upgrade: absl-py>=0.1.6 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (0.3.0)
Requirement already satisfied, skipping upgrade: grpcio>=1.8.6 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.13.0)
Requirement already satisfied, skipping upgrade: six>=1.10.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.10.0)
Requirement already satisfied, skipping upgrade: keras-applications>=1.0.6 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.0.6)
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow) (1.14.0)
Requirement already satisfied, skipping upgrade: IPython<6,>=5.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow-data-validation) (5.7.0)
Requirement already satisfied, skipping upgrade: tensorflow-metadata<0.10,>=0.9 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow-data-validation) (0.9.0)
Requirement already satisfied, skipping upgrade: tensorflow-transform<0.12,>=0.11 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow-data-validation) (0.11.0)
Requirement already satisfied, skipping upgrade: pandas<1,>=0.18 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow-data-validation) (0.22.0)
Requirement already satisfied, skipping upgrade: oauth2client<5,>=2.0.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (4.1.3)
Requirement already satisfied, skipping upgrade: dill<=0.2.8.2,>=0.2.6 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.2.8.2)
Requirement already satisfied, skipping upgrade: pydot<1.3,>=1.2.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (1.2.4)
Requirement already satisfied, skipping upgrade: pyyaml<4.0.0,>=3.12 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (3.12)
Requirement already satisfied, skipping upgrade: pyvcf<0.7.0,>=0.6.8 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.6.8)
Requirement already satisfied, skipping upgrade: typing<3.7.0,>=3.6.0; python_version < "3.5.0" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (3.6.4)
Requirement already satisfied, skipping upgrade: avro<2.0.0,>=1.8.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (1.8.2)
Requirement already satisfied, skipping upgrade: future<1.0.0,>=0.16.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.16.0)
Requirement already satisfied, skipping upgrade: fastavro<0.22,>=0.21.4 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.21.13)
Requirement already satisfied, skipping upgrade: crcmod<2.0,>=1.7 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (1.7)
Requirement already satisfied, skipping upgrade: httplib2<=0.11.3,>=0.8 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.11.3)
Requirement already satisfied, skipping upgrade: futures<4.0.0,>=3.1.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (3.2.0)
Requirement already satisfied, skipping upgrade: hdfs<3.0.0,>=2.1.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (2.1.0)
Requirement already satisfied, skipping upgrade: pytz<=2018.4,>=2018.3 in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (2018.4)
Requirement already satisfied, skipping upgrade: google-apitools<=0.5.20,>=0.5.18; extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.5.20)
Requirement already satisfied, skipping upgrade: proto-google-cloud-pubsub-v1==0.15.4; extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.15.4)
Requirement already satisfied, skipping upgrade: googledatastore==7.0.1; python_version < "3.0" and extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (7.0.1)
Requirement already satisfied, skipping upgrade: google-cloud-bigquery==0.25.0; extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.25.0)
Requirement already satisfied, skipping upgrade: google-cloud-pubsub==0.26.0; extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.26.0)
Requirement already satisfied, skipping upgrade: proto-google-cloud-datastore-v1<=0.90.4,>=0.90.0; extra == "gcp" in /Users/romain/dev/venv/lib/python2.7/site-packages (from apache-beam[gcp]) (0.90.4)
Requirement already satisfied, skipping upgrade: funcsigs>=1; python_version < "3.3" in /Users/romain/dev/venv/lib/python2.7/site-packages (from mock>=2.0.0->tensorflow) (1.0.2)
Requirement already satisfied, skipping upgrade: pbr>=0.11 in /Users/romain/dev/venv/lib/python2.7/site-packages (from mock>=2.0.0->tensorflow) (1.10.0)
Requirement already satisfied, skipping upgrade: werkzeug>=0.11.10 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorboard<1.13.0,>=1.12.0->tensorflow) (0.14.1)
Requirement already satisfied, skipping upgrade: markdown>=2.6.8 in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorboard<1.13.0,>=1.12.0->tensorflow) (2.6.11)
Requirement already satisfied, skipping upgrade: setuptools in /Users/romain/dev/venv/lib/python2.7/site-packages (from protobuf>=3.6.1->tensorflow) (39.1.0)
Requirement already satisfied, skipping upgrade: h5py in /Users/romain/dev/venv/lib/python2.7/site-packages (from keras-applications>=1.0.6->tensorflow) (2.8.0)
Requirement already satisfied, skipping upgrade: simplegeneric>0.8 in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (0.8.1)
Requirement already satisfied, skipping upgrade: pygments in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (2.2.0)
Requirement already satisfied, skipping upgrade: backports.shutil-get-terminal-size; python_version == "2.7" in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (1.0.0)
Requirement already satisfied, skipping upgrade: pexpect; sys_platform != "win32" in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (4.6.0)
Requirement already satisfied, skipping upgrade: prompt-toolkit<2.0.0,>=1.0.4 in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (1.0.15)
Requirement already satisfied, skipping upgrade: decorator in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (4.3.0)
Requirement already satisfied, skipping upgrade: pickleshare in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (0.7.4)
Requirement already satisfied, skipping upgrade: appnope; sys_platform == "darwin" in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (0.1.0)
Requirement already satisfied, skipping upgrade: traitlets>=4.2 in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (4.3.2)
Requirement already satisfied, skipping upgrade: pathlib2; python_version == "2.7" or python_version == "3.3" in /Users/romain/dev/venv/lib/python2.7/site-packages (from IPython<6,>=5.0->tensorflow-data-validation) (2.3.2)
Requirement already satisfied, skipping upgrade: googleapis-common-protos in /Users/romain/dev/venv/lib/python2.7/site-packages (from tensorflow-metadata<0.10,>=0.9->tensorflow-data-validation) (1.5.3)
Requirement already satisfied, skipping upgrade: python-dateutil in /Users/romain/dev/venv/lib/python2.7/site-packages (from pandas<1,>=0.18->tensorflow-data-validation) (2.7.3)
Requirement already satisfied, skipping upgrade: rsa>=3.1.4 in /Users/romain/dev/venv/lib/python2.7/site-packages (from oauth2client<5,>=2.0.1->apache-beam[gcp]) (3.4.2)
Requirement already satisfied, skipping upgrade: pyasn1>=0.1.7 in /Users/romain/dev/venv/lib/python2.7/site-packages (from oauth2client<5,>=2.0.1->apache-beam[gcp]) (0.1.9)
Requirement already satisfied, skipping upgrade: pyasn1-modules>=0.0.5 in /Users/romain/dev/venv/lib/python2.7/site-packages (from oauth2client<5,>=2.0.1->apache-beam[gcp]) (0.0.8)
Requirement already satisfied, skipping upgrade: pyparsing>=2.1.4 in /Users/romain/dev/venv/lib/python2.7/site-packages (from pydot<1.3,>=1.2.0->apache-beam[gcp]) (2.1.10)
Requirement already satisfied, skipping upgrade: docopt in /Users/romain/dev/venv/lib/python2.7/site-packages (from hdfs<3.0.0,>=2.1.0->apache-beam[gcp]) (0.6.2)
Requirement already satisfied, skipping upgrade: requests>=2.7.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from hdfs<3.0.0,>=2.1.0->apache-beam[gcp]) (2.11.1)
Requirement already satisfied, skipping upgrade: fasteners>=0.14 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-apitools<=0.5.20,>=0.5.18; extra == "gcp"->apache-beam[gcp]) (0.14.1)
Requirement already satisfied, skipping upgrade: google-cloud-core<0.26dev,>=0.25.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-cloud-bigquery==0.25.0; extra == "gcp"->apache-beam[gcp]) (0.25.0)
Requirement already satisfied, skipping upgrade: gapic-google-cloud-pubsub-v1<0.16dev,>=0.15.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-cloud-pubsub==0.26.0; extra == "gcp"->apache-beam[gcp]) (0.15.4)
Requirement already satisfied, skipping upgrade: ptyprocess>=0.5 in /Users/romain/dev/venv/lib/python2.7/site-packages (from pexpect; sys_platform != "win32"->IPython<6,>=5.0->tensorflow-data-validation) (0.5.2)
Requirement already satisfied, skipping upgrade: wcwidth in /Users/romain/dev/venv/lib/python2.7/site-packages (from prompt-toolkit<2.0.0,>=1.0.4->IPython<6,>=5.0->tensorflow-data-validation) (0.1.7)
Requirement already satisfied, skipping upgrade: ipython-genutils in /Users/romain/dev/venv/lib/python2.7/site-packages (from traitlets>=4.2->IPython<6,>=5.0->tensorflow-data-validation) (0.2.0)
Requirement already satisfied, skipping upgrade: scandir; python_version < "3.5" in /Users/romain/dev/venv/lib/python2.7/site-packages (from pathlib2; python_version == "2.7" or python_version == "3.3"->IPython<6,>=5.0->tensorflow-data-validation) (1.7)
Requirement already satisfied, skipping upgrade: monotonic>=0.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from fasteners>=0.14->google-apitools<=0.5.20,>=0.5.18; extra == "gcp"->apache-beam[gcp]) (1.5)
Requirement already satisfied, skipping upgrade: google-auth-httplib2 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-cloud-core<0.26dev,>=0.25.0->google-cloud-bigquery==0.25.0; extra == "gcp"->apache-beam[gcp]) (0.0.3)
Requirement already satisfied, skipping upgrade: google-auth<2.0.0dev,>=0.4.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-cloud-core<0.26dev,>=0.25.0->google-cloud-bigquery==0.25.0; extra == "gcp"->apache-beam[gcp]) (1.1.1)
Requirement already satisfied, skipping upgrade: grpc-google-iam-v1<0.12dev,>=0.11.1 in /Users/romain/dev/venv/lib/python2.7/site-packages (from gapic-google-cloud-pubsub-v1<0.16dev,>=0.15.0->google-cloud-pubsub==0.26.0; extra == "gcp"->apache-beam[gcp]) (0.11.4)
Requirement already satisfied, skipping upgrade: google-gax<0.16dev,>=0.15.7 in /Users/romain/dev/venv/lib/python2.7/site-packages (from gapic-google-cloud-pubsub-v1<0.16dev,>=0.15.0->google-cloud-pubsub==0.26.0; extra == "gcp"->apache-beam[gcp]) (0.15.16)
Requirement already satisfied, skipping upgrade: cachetools>=2.0.0 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-auth<2.0.0dev,>=0.4.0->google-cloud-core<0.26dev,>=0.25.0->google-cloud-bigquery==0.25.0; extra == "gcp"->apache-beam[gcp]) (2.0.1)
Requirement already satisfied, skipping upgrade: ply==3.8 in /Users/romain/dev/venv/lib/python2.7/site-packages (from google-gax<0.16dev,>=0.15.7->gapic-google-cloud-pubsub-v1<0.16dev,>=0.15.0->google-cloud-pubsub==0.26.0; extra == "gcp"->apache-beam[gcp]) (3.8)
This error is due to TFDV not installed in the dataflow workers. Can you try the following:
!pip install -U tensorflow \
tensorflow-data-validation \
apache-beam[gcp]
# Download TFDV wheel file to be provided to dataflow workers.
# This saves the wheel file to the current directory.
!pip download tensorflow_data_validation --no-deps --platform manylinux1_x86_64 --only-binary=:all:
import tensorflow_data_validation as tfdv
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, SetupOptions
# Create and set your PipelineOptions.
pipeline_options = PipelineOptions()
# For Cloud execution, set the Cloud Platform project, job_name,
# staging location, temp_location and specify DataflowRunner.
google_cloud_options = pipeline_options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT_ID
google_cloud_options.job_name = JOB_NAME
google_cloud_options.staging_location = GCS_STAGING_LOCATION
google_cloud_options.temp_location = GCS_TMP_LOCATION
pipeline_options.view_as(StandardOptions).runner = 'DataflowRunner'
setup_options = pipeline_options.view_as(SetupOptions)
# Path of the wheel file we downloaded.
setup_options.extra_packages = ['tensorflow_data_validation-0.11.0-cp27-cp27mu-manylinux1_x86_64.whl']
# Make sure to specify a GCS output file path to write stats to.
tfdv.generate_statistics_from_tfrecord(TFRECORDS_PATH,
output_path=GCS_STATS_FILE_OUTPUT_PATH,
pipeline_options=pipeline_options)
Thanks a lot for coming back to me quickly @paulgc. Will give it a try and report on progress.
Seems like it might be worth adding this ^ in the get_started.md#running-on-google-cloud doc.
Thanks again for your help @paulgc, it worked. FYI Since I'm not running my code on a linux box, I needed to be more specific about the environment for it to work:
pip download tensorflow-data-validation \
--platform manylinux1_x86_64 \
--python-version 27 \
--implementation cp \
--abi cp27mu \
--no-deps \
--only-binary=:all:
At a first glance, it seems like this approach of downloading the wheel on the job submitter machine is not as robust as alternative approaches, like passing a requirements_file or using a "harness container image". -> Which approach do you think is the most sustainable?
@aaltay @katsiapis
In having these same issues where the worker nodes are not able to import TFDV, I've had success with the following modifications to the Running on Google Cloud documentation.
...
debug_options = options.view_as(DebugOptions)
debug_options.experiments = ['beam_fn_api']
worker_options = options.view_as(WorkerOptions)
worker_options.worker_harness_container_image = 'gcr.io/ml-sketchbook/test-brianm-tfdv:0'
...
Where the worker harness container image is:
FROM gcr.io/cloud-dataflow/v1beta3/python-fnapi:2.8.0
RUN pip install tensorflow-data-validation
So to summarize:
- Is using
worker_harness_container_imagean okay approach? - As currently written, the Running on Google Cloud documentation is broken. If we can reach a good/preferred solution here, it should be updated.
Do you set --save_main_session pipeline option on your pipeline? If not could you try that please?
Unless you are explicitly doing alpha-testing of FnApi execution mode on Dataflow, you should not be setting beam_fn_api experiment and please don't hardcode the container image, Beam SDK should set those flags for you.
We're currently using the approach suggested by @paulgc above: https://github.com/tensorflow/data-validation/issues/38#issuecomment-439255405
We did not have any real reason to use the FnApi, I was simply exploring options which didn't involve downloading the TFDV wheel on our side. We've had success with --setup_file as well.
If we do have a need for FnApi, I'll definitely give --save_main_session a look.
@yonromai
I would suggest you to go with Harness container image approach if you're looking sustainable solution instead of requirement file but as per Tensorflow documentation, To run TFDV on Google Cloud, the TFDV wheel file must be downloaded and provided to the Dataflow workers. Here is reference link of that documentation
Could you please close this issue if your issue got resolved with previous solution provided by @paulgc
Thank You!
Hi, @yonromai
Closing this issue due to inactivity. Please feel free to reopen if this still exists. Thank you!