tfx icon indicating copy to clipboard operation
tfx copied to clipboard

StatisticsGen - slice_functions in StatsOptions creates no slices

Open ConverJens opened this issue 4 years ago • 4 comments

I'm running a TFX pipeline using TFX 0.24.0 with an ImportExampleGen and StatisticsGen component.

I'm trying to configure slice_functions for the StatisticsGen component but it gives no result. The overall dataset statistics is computed and works as expected.

My sample dataset has five features: name, ts, datetime, duration and uniq. The code for the StatisticsGen component looks like the following:

slice_name = slicing_util.get_feature_value_slicer(features={'name': None})
stats_options = tfdv.StatsOptions(slice_functions=[slice_name_raw])
statistics_gen = StatisticsGen(examples=example_gen.outputs['examples'], stats_options=stats_options)

Loading and visualizing the stats artifact works as expected:

stats = tfdv.load_statistics(os.path.join(artifact.uri, "train", "stats_tfrecord"))
tfdv.visualize_statistics(stats)

However, trying to fetch a slice of the stats fails: tfdv.get_slice_stats(stats, 'name')' '---------------------------------------------------------------------------

ValueError Traceback (most recent call last)
<ipython-input-21-f5836823d287> in <module>
----> 1 tfdv.get_slice_stats(stats, 'name')
~/pyenv/lib/python3.7/site-packages/tensorflow_data_validation/utils/stats_util.py in get_slice_stats(statistics, slice_key)
315 result.datasets.add().CopyFrom(slice_stats)
316 return result
--> 317 raise ValueError('Invalid slice key.')
318
319
ValueError: Invalid slice key.

The issue seems to be that the additional slices are never generated: len(stats.datasets) yields 1 and len(stats.datasets[0].features) yields 5.

Running the pipeline in KubeFlow or interactive_context yields the same result.

Also tried this in colab using the https://www.tensorflow.org/tfx/tutorials/mlmd/mlmd_tutorial#top_of_page as a base and added a slice function but with the exact same result. There is still only the overall dataset present, no slice datasets are added to the stats_tfrecord proto.

Any help with this would be appreciated!

ConverJens avatar Oct 12 '20 12:10 ConverJens

This is a known issue. Only certain fields in that tfdv.StatsOptions (which you passed when creating the StatisticsGen component) are honored currently.

https://github.com/tensorflow/tfx/blob/ba2f450973ee66721488809d359841bf144dd6ae/tfx/components/statistics_gen/component.py#L64

brills avatar Oct 14 '20 16:10 brills

@brills I see, thanks for the clarification. Is it planned to be fixed? Any time line for it?

jenswir avatar Oct 14 '20 16:10 jenswir

We have a plan to fix it. We will provide a slicing config in addition to the slicing_fn in stats_options that allows to slice by features. However we don't have a timeline at this time.

brills avatar Oct 16 '20 16:10 brills

@ConverJens,

slice_functions is deprecated and you can use experimental_slice_functions to generate slice keys. Please refer tfdv.StatsOptions. Thank you!

singhniraj08 avatar Oct 07 '22 10:10 singhniraj08

Closing this due to inactivity. Please take a look into the answers provided above, feel free to reopen and post your comments(if you still have queries on this). Thank you!

singhniraj08 avatar Jan 20 '23 13:01 singhniraj08

Are you satisfied with the resolution of your issue? Yes No

google-ml-butler[bot] avatar Jan 20 '23 13:01 google-ml-butler[bot]

Hi @singhniraj08 , I still have the same (or very similar) issue:

import tensorflow_data_validation as tfdv
from tensorflow_data_validation.utils import slicing_util

# Slice on region feature (i.e., every unique value of the feature).
slicer_region = slicing_util.get_feature_value_slicer(features={'region': None})
stats_options = tfdv.StatsOptions(experimental_slice_functions=[slicer_region])
tfdv_statistics = tfdv.generate_statistics_from_dataframe(data, stats_options)
>>> len(tfdv_statistics.datasets)
1
>>> tfdv.get_slice_stats(tfdv_statistics, 'region')
ValueError: Invalid slice key.

The visualization also doesn't show any information on slices.

In addition, I can see that the StatsOptions.experimental_slice_functions is actually never called for calculating statistics (only during __init__ to validate input arguments)

What am I doing wrong?

Ruwann avatar Jun 13 '23 10:06 Ruwann

@Ruwann,

I was able to make create slices on given feature and get slice stats for particular slice key. The tfdv.get_slice_stats function expects the dataset name as slice key to show the stats of particular dataset slice. You can refer example gist for reference. Thanks.

singhniraj08 avatar Jun 20 '23 07:06 singhniraj08

Thanks for the example gist, really helpful for me! I was able to figure it out: it seems that generate_statistics_from_dataframe does not calculate slice statistics.

For future reference, the following workaround works for me:

from tempfile import NamedTemporaryFile
import tensorflow_data_validation as tfdv
from tensorflow_data_validation.utils import slicing_util

# Slice on region feature (i.e., every unique value of the feature).
slicer_region = slicing_util.get_feature_value_slicer(features={'region': None})
stats_options = tfdv.StatsOptions(experimental_slice_functions=[slicer_region])
with NamedTemporaryFile(suffix=".csv") as f:
    data.to_csv(f.name)
    tfdv_statistics = tfdv.generate_statistics_from_csv(
        data_location=f.name,
        stats_options=stats_options
    )

If you want, I can open up a separate issue for this?

Ruwann avatar Jun 20 '23 12:06 Ruwann