data-validation
data-validation copied to clipboard
Add correlations to Facets charts/tables
TensorFlow Data Validation is a great tool to look at the data. One feature that might make it even better is if it would also compute correlations among the variables, so that if two variables are highly correlated you can avoid multicollinearities by dropping one of the correlated variables. Having that available in the facets visualization would make it easier to spot issues with the data.
@jameswex
Two pieces here:
- Calculating correlations between features. Does anything in TFDV do this currently? If not, would need to define a proto format for capturing this data, and build a pipeline to calculate it.
- Once that is done, would need a visualization to best show this information. It's possible it could be part of Facets Overview, but also possible that it might work best as a new visualization, as Facets Overview hasn't been designed with cross-feature statistics (such as correlation) in mind.
@jameswex We are currently planning to compute correlation statistics in TFDV and probably update TF.Metadata statistics proto to capture these statistics.
Any updates on this / where it is on the roadmap?
I agree, this tool is excellent and the correlations are the only thing missing at the moment.
As such, I was happy to see that it was already raised.
Cheers.
We alrady have a stats generator (tensorflow_data_validation/statistics/generators/cross_feature_stats_generator.py). You can try enabling it by specifying it in StatsOptions.generators
But currently Facets does not visualize the results.
We could attach the cross stats as custom stats (like the LiftStatsGenerator does).
Hello, is there an update about the possibility to have the correlation in tfdv.visualize_statistics() ? great tool ! but I think this is needed Thanks