ydata-profiling Feature : Correlations

Branch : spark-branch

Feature :

Three types of correlations - Cramer's V, Kendall's correlations and Phi-K are implemented in pandas-profiling, but not in spark-profiling. We would need to implement them in spark in an optimised manner.

[x] Phi-K - correlations between continuous and discrete variables
[ ] Cramer's V - correlation between discrete variables
[ ] Kendall's Tau - correlation between continuous or discrete (ordinal) variables

Tips to Get Started :

Check in with Edwin as he has some code on this
Each correlation might take a long time to optimise, and is not so trivial

Oct 04 '21 15:10 chanedwin

phik - done in https://github.com/pandas-profiling/pandas-profiling/commit/b3b41cc0d127ac3dac3480cd94a55f9556b671dc

Oct 23 '21 10:10 chanedwin

@chanedwin Hi, I would like to get started on this bug. Could you guide me with some code?

Feb 03 '22 12:02 rishabsinghh

hi @rishabsinghh! Sure! I'll update with more in this post in a bit. You can DM me on the pp slack too if you have any further questions!

Feb 04 '22 08:02 chanedwin

Sure, will be waiting. I didn't get the PP slack? Like how can reach you through that?

Feb 04 '22 13:02 rishabsinghh

Code : take a look at this! https://github.com/chanedwin/pandas-profiling/blob/d9ee4a8a589e075cfced9fc71ca500a20e2a3e73/src/pandas_profiling/model/correlations.py#L140

This was my original implementation using vectorized pandasUDFs for Kendall and Cramer's V, but I think we should do this in native spark if possible because we should see significant speed improvements (although that might not be so trivial). We can continue discussions on slack!

You can join the slack here!

Feb 04 '22 16:02 chanedwin

ydata-profiling ydata-profiling copied to clipboard

Feature : Correlations

ydata-profiling
ydata-profiling copied to clipboard