data-validation icon indicating copy to clipboard operation
data-validation copied to clipboard

Add correlations to Facets charts/tables

Open ianhellstrom opened this issue 5 years ago • 6 comments

TensorFlow Data Validation is a great tool to look at the data. One feature that might make it even better is if it would also compute correlations among the variables, so that if two variables are highly correlated you can avoid multicollinearities by dropping one of the correlated variables. Having that available in the facets visualization would make it easier to spot issues with the data.

ianhellstrom avatar May 17 '19 06:05 ianhellstrom

@jameswex

paulgc avatar May 17 '19 17:05 paulgc

Two pieces here:

  1. Calculating correlations between features. Does anything in TFDV do this currently? If not, would need to define a proto format for capturing this data, and build a pipeline to calculate it.
  2. Once that is done, would need a visualization to best show this information. It's possible it could be part of Facets Overview, but also possible that it might work best as a new visualization, as Facets Overview hasn't been designed with cross-feature statistics (such as correlation) in mind.

jameswex avatar May 17 '19 17:05 jameswex

@jameswex We are currently planning to compute correlation statistics in TFDV and probably update TF.Metadata statistics proto to capture these statistics.

paulgc avatar May 17 '19 17:05 paulgc

Any updates on this / where it is on the roadmap?

I agree, this tool is excellent and the correlations are the only thing missing at the moment.

As such, I was happy to see that it was already raised.

Cheers.

robinvanschaik avatar Jan 23 '21 17:01 robinvanschaik

We alrady have a stats generator (tensorflow_data_validation/statistics/generators/cross_feature_stats_generator.py). You can try enabling it by specifying it in StatsOptions.generators

But currently Facets does not visualize the results.

We could attach the cross stats as custom stats (like the LiftStatsGenerator does).

brills avatar Apr 07 '21 17:04 brills

Hello, is there an update about the possibility to have the correlation in tfdv.visualize_statistics() ? great tool ! but I think this is needed Thanks

AndresMontero avatar Feb 11 '22 14:02 AndresMontero