tfx
tfx copied to clipboard
StatisticsGen - slice_functions in StatsOptions creates no slices
I'm running a TFX pipeline using TFX 0.24.0 with an ImportExampleGen
and StatisticsGen
component.
I'm trying to configure slice_functions for the StatisticsGen component but it gives no result. The overall dataset statistics is computed and works as expected.
My sample dataset has five features: name, ts, datetime, duration and uniq. The code for the StatisticsGen component looks like the following:
slice_name = slicing_util.get_feature_value_slicer(features={'name': None})
stats_options = tfdv.StatsOptions(slice_functions=[slice_name_raw])
statistics_gen = StatisticsGen(examples=example_gen.outputs['examples'], stats_options=stats_options)
Loading and visualizing the stats artifact works as expected:
stats = tfdv.load_statistics(os.path.join(artifact.uri, "train", "stats_tfrecord"))
tfdv.visualize_statistics(stats)
However, trying to fetch a slice of the stats fails:
tfdv.get_slice_stats(stats, 'name')
' '---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-21-f5836823d287> in <module>
----> 1 tfdv.get_slice_stats(stats, 'name')
~/pyenv/lib/python3.7/site-packages/tensorflow_data_validation/utils/stats_util.py in get_slice_stats(statistics, slice_key)
315 result.datasets.add().CopyFrom(slice_stats)
316 return result
--> 317 raise ValueError('Invalid slice key.')
318
319
ValueError: Invalid slice key.
The issue seems to be that the additional slices are never generated: len(stats.datasets) yields 1 and len(stats.datasets[0].features) yields 5.
Running the pipeline in KubeFlow or interactive_context yields the same result.
Also tried this in colab using the https://www.tensorflow.org/tfx/tutorials/mlmd/mlmd_tutorial#top_of_page as a base and added a slice function but with the exact same result. There is still only the overall dataset present, no slice datasets are added to the stats_tfrecord proto.
Any help with this would be appreciated!
This is a known issue. Only certain fields in that tfdv.StatsOptions (which you passed when creating the StatisticsGen component) are honored currently.
https://github.com/tensorflow/tfx/blob/ba2f450973ee66721488809d359841bf144dd6ae/tfx/components/statistics_gen/component.py#L64
@brills I see, thanks for the clarification. Is it planned to be fixed? Any time line for it?
We have a plan to fix it. We will provide a slicing config in addition to the slicing_fn in stats_options that allows to slice by features. However we don't have a timeline at this time.
@ConverJens,
slice_functions
is deprecated and you can use experimental_slice_functions
to generate slice keys. Please refer tfdv.StatsOptions. Thank you!
Closing this due to inactivity. Please take a look into the answers provided above, feel free to reopen and post your comments(if you still have queries on this). Thank you!
Hi @singhniraj08 , I still have the same (or very similar) issue:
import tensorflow_data_validation as tfdv
from tensorflow_data_validation.utils import slicing_util
# Slice on region feature (i.e., every unique value of the feature).
slicer_region = slicing_util.get_feature_value_slicer(features={'region': None})
stats_options = tfdv.StatsOptions(experimental_slice_functions=[slicer_region])
tfdv_statistics = tfdv.generate_statistics_from_dataframe(data, stats_options)
>>> len(tfdv_statistics.datasets)
1
>>> tfdv.get_slice_stats(tfdv_statistics, 'region')
ValueError: Invalid slice key.
The visualization also doesn't show any information on slices.
In addition, I can see that the StatsOptions.experimental_slice_functions
is actually never called for calculating statistics (only during __init__
to validate input arguments)
What am I doing wrong?
@Ruwann,
I was able to make create slices on given feature and get slice stats for particular slice key. The tfdv.get_slice_stats
function expects the dataset name as slice key to show the stats of particular dataset slice. You can refer example gist for reference. Thanks.
Thanks for the example gist, really helpful for me!
I was able to figure it out: it seems that generate_statistics_from_dataframe
does not calculate slice statistics.
For future reference, the following workaround works for me:
from tempfile import NamedTemporaryFile
import tensorflow_data_validation as tfdv
from tensorflow_data_validation.utils import slicing_util
# Slice on region feature (i.e., every unique value of the feature).
slicer_region = slicing_util.get_feature_value_slicer(features={'region': None})
stats_options = tfdv.StatsOptions(experimental_slice_functions=[slicer_region])
with NamedTemporaryFile(suffix=".csv") as f:
data.to_csv(f.name)
tfdv_statistics = tfdv.generate_statistics_from_csv(
data_location=f.name,
stats_options=stats_options
)
If you want, I can open up a separate issue for this?