ydata-profiling
ydata-profiling copied to clipboard
Feature : Correlations
Overview : Spark Development Strategy
Branch : spark-branch
Feature :
Three types of correlations - Cramer's V, Kendall's correlations and Phi-K are implemented in pandas-profiling, but not in spark-profiling. We would need to implement them in spark in an optimised manner.
- [x] Phi-K - correlations between continuous and discrete variables
- [ ] Cramer's V - correlation between discrete variables
- [ ] Kendall's Tau - correlation between continuous or discrete (ordinal) variables
Tips to Get Started :
- Check in with Edwin as he has some code on this
- Each correlation might take a long time to optimise, and is not so trivial
phik - done in https://github.com/pandas-profiling/pandas-profiling/commit/b3b41cc0d127ac3dac3480cd94a55f9556b671dc
@chanedwin Hi, I would like to get started on this bug. Could you guide me with some code?
hi @rishabsinghh! Sure! I'll update with more in this post in a bit. You can DM me on the pp slack too if you have any further questions!
Sure, will be waiting. I didn't get the PP slack? Like how can reach you through that?
Code : take a look at this! https://github.com/chanedwin/pandas-profiling/blob/d9ee4a8a589e075cfced9fc71ca500a20e2a3e73/src/pandas_profiling/model/correlations.py#L140
This was my original implementation using vectorized pandasUDFs for Kendall and Cramer's V, but I think we should do this in native spark if possible because we should see significant speed improvements (although that might not be so trivial). We can continue discussions on slack!
You can join the slack here!