skrub icon indicating copy to clipboard operation
skrub copied to clipboard

Add a new way to compute column correlations

Open rcap107 opened this issue 4 months ago • 7 comments

As was suggested on mattermost by @dholzmueller (so I don't lose track of it)

A very fast (linear algebra, so vectorizable) method to get (nonlinear) dependency scores between features: https://www.pnas.org/doi/10.1073/pnas.2509860122 Maybe with some adjustment for using categoricals it could be a candidate to replace Cramer's V in skrub. (Apparently ordinal encoding categoricals already works well, but I suspect that's not for high-cardinality.)

rcap107 avatar Aug 27 '25 09:08 rcap107

Hi, I would like to work on this issue. Do you have some specific guidelines or recommendations ?

JadeAffolabi avatar Oct 26 '25 20:10 JadeAffolabi

Hi @JadeAffolabi, thanks for checking out the issue. Be aware that this looks like a fairly complicated issue to start with.

Aside from the usual considerations that are reported in https://skrub-data.org/stable/CONTRIBUTING.html, you can start by looking at skrub/_column_associations.py to see how we implemented Cramer's V and Pearson.

As with everything, the code that is used for the feature should support both pandas and polars, so you might need to add a function to skrub/_dataframe/_common.py.

There is also the problem of adding this to the table report once the new association scores are computed, but that may also be tackled in a separate PR.

rcap107 avatar Oct 27 '25 15:10 rcap107

Hi, I think I can do it. Thanks for the indications.

JadeAffolabi avatar Oct 30 '25 07:10 JadeAffolabi

Hi, I think I can do it. Thanks for the indications.

@JadeAffolabi I realized late I pasted the wrong link, I meant to refer to the contributing guide

rcap107 avatar Oct 30 '25 08:10 rcap107

I think besides the raw implementation there are two main questions to me that need to be solved:

  • How to handle categoricals? For a first draft one could ordinal-encode them and handle them as numericals, but I think for higher-cardinality categoricals it should be better to directly create embeddings instead of using the numerical kernel. Maybe using OneHotEncoder with some value for max_categories could be good.
  • How to avoid reporting overfitted scores? IIRC the paper mentions something about computing p-values or so, maybe with permutation tests. I don't know how big of a problem it is, but at least we'd want to avoid giving higher scores to categorical variables because they're easier to overfit.

cc'ing one of the authors, @aradha

dholzmueller avatar Oct 30 '25 09:10 dholzmueller

Yes, the one hot encoding is definitely a better approach than ordinal-encode. The original paper does mention computing p-values on the permutation tests in order to decide whether two variables should be declared dependent.

JadeAffolabi avatar Oct 30 '25 12:10 JadeAffolabi

Hello, I've submitted a pull request.

JadeAffolabi avatar Nov 03 '25 09:11 JadeAffolabi