data-validation icon indicating copy to clipboard operation
data-validation copied to clipboard

TFDV fails everytime on GC Dataflow job

Open AnoderPersona opened this issue 3 years ago • 3 comments

For whatever reason when trying to start a dataflow job for tfdv.generate_statistics_from_csv using gc storage, doesn't work in this version for me (it fails on the fourth step every time). However it does work for the previous TFX version (had to downgrade).

Version that supposedly has the issue: tfx 1.6.0 Version that works for me: tfx 1.5.0

Code example:

from apache_beam.options.pipeline_options import (
 PipelineOptions, GoogleCloudOptions, StandardOptions)

options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = ****
google_cloud_options.region = 'us-west1'
google_cloud_options.job_name = 'generando-stats'
google_cloud_options.staging_location = 'gs://***/staging'
google_cloud_options.temp_location = 'gs://***/tmp'
options.view_as(StandardOptions).runner = 'DataflowRunner'

from apache_beam.options.pipeline_options import SetupOptions

setup_options = options.view_as(SetupOptions)
setup_options.extra_packages = ['/home/jupyter/tensorflow_data_validation-1.5.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl']

import tensorflow_data_validation as tfdv

#ruta del archivo de datos
data_set_path = 'gs://***/csv/consumer_complaints_with_narrative.csv'

#ruta del archivo output con estadpisticas
output_path = 'gs://***/stats.pb'
stats = tfdv.generate_statistics_from_csv(data_location= data_set_path,
                                  output_path=output_path,
                                  pipeline_options=options)

Step in which always fails: image

Error according to dataflow:

TypeError: 'int' object is not iterable 

This was made in google cloud dataflow's jupyter notebook. Hope this helps someone

AnoderPersona avatar Feb 04 '22 18:02 AnoderPersona

Thanks for opening this issue. Were you using a released version of TFDV 1.6.0, or did you build TFDV from source?

caveness avatar Feb 10 '22 22:02 caveness

The released one. I actually didn't even noticed it had updated until two days later haha

AnoderPersona avatar Feb 10 '22 22:02 AnoderPersona

Thanks for the info -- Could you provide an excerpt of your logs containing the error? It would be very useful to know what stage this is happening at.

caveness avatar Feb 16 '22 21:02 caveness

Closing this issue in light of the lack of further information, but please reopen if this is still an issue and you can provide logs or more detail about where the error you saw is happening. Thanks!

caveness avatar Aug 18 '22 18:08 caveness