ydata-profiling icon indicating copy to clipboard operation
ydata-profiling copied to clipboard

Feature : Correlations

Open chanedwin opened this issue 3 years ago • 5 comments

Overview : Spark Development Strategy

Branch : spark-branch

Feature :

Three types of correlations - Cramer's V, Kendall's correlations and Phi-K are implemented in pandas-profiling, but not in spark-profiling. We would need to implement them in spark in an optimised manner.

  • [x] Phi-K - correlations between continuous and discrete variables
  • [ ] Cramer's V - correlation between discrete variables
  • [ ] Kendall's Tau - correlation between continuous or discrete (ordinal) variables

Tips to Get Started :

  • Check in with Edwin as he has some code on this
  • Each correlation might take a long time to optimise, and is not so trivial

chanedwin avatar Oct 04 '21 15:10 chanedwin

phik - done in https://github.com/pandas-profiling/pandas-profiling/commit/b3b41cc0d127ac3dac3480cd94a55f9556b671dc

chanedwin avatar Oct 23 '21 10:10 chanedwin

@chanedwin Hi, I would like to get started on this bug. Could you guide me with some code?

rishabsinghh avatar Feb 03 '22 12:02 rishabsinghh

hi @rishabsinghh! Sure! I'll update with more in this post in a bit. You can DM me on the pp slack too if you have any further questions!

chanedwin avatar Feb 04 '22 08:02 chanedwin

Sure, will be waiting. I didn't get the PP slack? Like how can reach you through that?

rishabsinghh avatar Feb 04 '22 13:02 rishabsinghh

Code : take a look at this! https://github.com/chanedwin/pandas-profiling/blob/d9ee4a8a589e075cfced9fc71ca500a20e2a3e73/src/pandas_profiling/model/correlations.py#L140

This was my original implementation using vectorized pandasUDFs for Kendall and Cramer's V, but I think we should do this in native spark if possible because we should see significant speed improvements (although that might not be so trivial). We can continue discussions on slack!

You can join the slack here!

chanedwin avatar Feb 04 '22 16:02 chanedwin