feast [dqm]: Features Baseline Statistics Profiler

Is your feature request related to a problem? Please describe.

Data Quality Monitoring include the option to create an ExpectationSuite based on GE. For numerical or categorical features is convenient to calculate baseline statistics depending on the type of feature.

For example, numerical features can calculate basic statistics like mean, median, std etc; categorical features can calculate frequency, etc.

{
    'baseline_statistics': [
        {
            'feature_name': 'dummy_feature_num_type',
            'feature_type': 'numerical',
            'statistics': {
                   'name': 'median',
                   'value': 50 
             }
        },
        {
            'feature_name': 'dummy_feature_cat:type',
            'feature_type': 'categorical',
            'statistics': {
                   'name': 'most_frequent',
                   'value': 'cat_1'
             }
        },
    ]
}

Once the baseline statistics are calculated and registered somewhere, a GE Profiler can be created

@ge_profiler
def stats_profiler(ds: PandasDataset) -> ExpectationSuite:
    ds.expect_column_mean_to_be_between(
       ...
    )

    ds.expect_column_median_to_be_between(
       ...
    )
    return ds.get_expectation_suite()

Finally a new set of features could be validated using the baseline profiler

feats = get_historical_features()

feats.to_df(validation_reference=validation_reference)

Describe the solution you'd like

Create NumericalProfiler and CategoricalProfiler that calculate baseline stastics and generate a GE Profiler

from abc import abstractmethod

class BaseFeatureProfiler:
     @abstractmethod
     def calculate_baseline_stastics():

    @abstractmethod
    def create_ge_profiler():

   
class NumericalProfiler(BaseFeatureProfiler):
     # Default statistics to calculate
     _stastistics = [
         'mean',
         'median',
         'std',
         'min',
         'max',
         'quantiles'
      ]

     def __init__(
         variables: list,   # pass a list of numerical features
         stastics: list,      # pass a list of baseline stastics, default are _statistics attribute
     )

    def calculate_baseline_stastics():
         """Calculate Numerical features baseline statistics""""

    def create_ge_profiler():
        """Define GE Profiler to check numerical feature stastics.""""

Then a validation reference could be created with NumericalProfiler, CategoricalProfiler or both

numerical_profiler = NumericalProfiler.create_ge_profiler(**kwargs)
validation_reference = ds.as_reference(profiler=numerical_profiler)

Apr 26 '22 02:04 TremaMiguel

Hi @TremaMiguel, thanks for raising this issue.

If I understand correctly your intention here is to have automatic profiler. Which is great idea by itself. However, it might be not as simple as just to use stats-based generated profile in real-world circumstances. In fact, there's already implementation that I extensively experimented with: BasicDatasetProfiler. This implementation among other things does basic stats calculation and generates expectations based on these numbers. The problem is it doesn't take into account any distribution shift. And there will be shift. If we're talking about comparing two non-equal datasets, their stats will never be exactly the same. And especially if we look at the case when we compare small window of logged features (say 15 minutes window) against training dataset statistics. That difference will be even bigger than comparing two batch datasets.

So user at least needs to specify thresholds or сonfidence intervals by herself for each feature, which this implementation doesn't allow. For this reason in our documentation we recommend to write expectations yourself (at least for now), since you can specify thresholds and percent of expected outliers. Of course, we plan to work on automatic profilers in some foreseeable future. But it feels like a very long story.

For now I can also recommend to look at Rule-based Profilers that were recently introduced in GE.

Apr 27 '22 18:04 pyalex

Hi @pyalex, thanks for answering this one.

I was thinking about a scenario where you have accumulated a considerable amount of production data to compare against the baseline dataset, and that you can make some calculations to get an idea of plausible distribution shift.

But, I agree with you that BasicDatasetProfiler is just too basic to take a decision if you see some changes. I think more easy and informant metrics to use are PSI and CSI.

Additionally, I'm not sure if it's in the scope of Feast to integrate a Monitoring Module to be able to detect Data, Feature Attribution and Model Drift, because this also involves taking into account scheduling the monitoring job, notifying results or trigger events to take some action.

Apr 30 '22 01:04 TremaMiguel

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sep 20 '22 19:09 stale[bot]