The generate_statistics_from_csv very slowly for large dataset in single server
Hi According to the tfx examples, I pass the pipeline_options to generate_statistics_from_csv which set --direct_num_workers=16 like:
pipeline_options = PipelineOptions(['--direct_num_workers=16'])
It's seem that this option cannot speed up this API, when I set direct_num_workers=1, the cost time is equal the 16 worker, like that:
# direct_num_workers=1
python prep.py 99.27s user 5.84s system 99% cpu 1:45.67 total
# direct_num_workers=16
python prep.py 101.92s user 5.22s system 98% cpu 1:48.44 total
Could someone help me?
Another option is to try using generate_statistics_from_dataframe if you can load your dataset as a pandas dataframe.
import tensorflow_data_validation as tfdv
import pandas as pd
CSV_FILE_PATH = ''
df = pd.read_csv(CSV_FILE_PATH)
stats = tfdv.generate_statistics_from_dataframe(df)
Hi Yajunwang,
When executing your pipeline locally, the default values for the properties in PipelineOptions are generally sufficient and direct runner on one compute. https://cloud.google.com/dataflow/docs/guides/specifying-exec-params
On Sat, Jan 4, 2020, 04:18 Paul Suganthan [email protected] wrote:
Another option is to try using generate_statistics_from_dataframe if you can load your dataset as a pandas dataframe.
import tensorflow_data_validation as tfdv import pandas as pd CSV_FILE_PATH = '' df = pd.read_csv(CSV_FILE_PATH) stats = tfdv.generate_statistics_from_dataframe(df)
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/data-validation/issues/98?email_source=notifications&email_token=AEYAML5PHJXRYIRFD6X3GCTQ36TUHA5CNFSM4KCR5WI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEICDLOQ#issuecomment-570701242, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYAMLYSHUNOI27MX5D4ZNTQ36TUHANCNFSM4KCR5WIQ .
Hi Yajunwang, When executing your pipeline locally, the default values for the properties in PipelineOptions are generally sufficient and direct runner on one compute. https://cloud.google.com/dataflow/docs/guides/specifying-exec-params … On Sat, Jan 4, 2020, 04:18 Paul Suganthan @.***> wrote: Another option is to try using generate_statistics_from_dataframe if you can load your dataset as a pandas dataframe. import tensorflow_data_validation as tfdv import pandas as pd CSV_FILE_PATH = '' df = pd.read_csv(CSV_FILE_PATH) stats = tfdv.generate_statistics_from_dataframe(df) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#98?email_source=notifications&email_token=AEYAML5PHJXRYIRFD6X3GCTQ36TUHA5CNFSM4KCR5WI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEICDLOQ#issuecomment-570701242>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYAMLYSHUNOI27MX5D4ZNTQ36TUHANCNFSM4KCR5WIQ .
It's seem not invalid for this option! Please infer this gist https://gist.github.com/yajunwong/f317c565f375125fd3ec2963967ba164
Another option is to try using
generate_statistics_from_dataframeif you can load your dataset as a pandas dataframe.import tensorflow_data_validation as tfdv import pandas as pd CSV_FILE_PATH = '' df = pd.read_csv(CSV_FILE_PATH) stats = tfdv.generate_statistics_from_dataframe(df)
I try to this api, but report error, please refer this issue: https://github.com/tensorflow/data-validation/issues/98#issuecomment-570701242