skrub icon indicating copy to clipboard operation
skrub copied to clipboard

FEAT - Add interdependence score to column associations

Open JadeAffolabi opened this issue 1 month ago • 9 comments

Description This pull request is linked to issue #1576 .

I implemented the numpy version of the Interdependence Score (IDS) from the paper Efficiently quantifying dependence in massive scientific datasets using InterDependence Scores.

New file The new functions are in the file skrub/_interdependence_score.py. The main function of the file is _ids_matrix(). It can be called directly.

TO DO If the PR is accepted, I can modified the file skrub/_column_associations.py so that the IDS could be displayed in the output of column_associations, along with the pvalues associated with the IDS.

JadeAffolabi avatar Nov 02 '25 22:11 JadeAffolabi

Hi @JadeAffolabi, a few points about this PR.

As I was expecting, this will be a very complex PR that will need various iterations of discussions and implementation, and will take a long time to review. If your expectation is for this PR to be merged in the short term (e.g., for some kind of assignment...), that will most likely not be the case.

Then, there is no need to open and close the PR multiple times. If you need to re-run the CI, you need to push a new commit to trigger the execution of the CI again.

The PR is missing all tests (which is also why coverage is not passing), so they should be added. You can refer to https://github.com/skrub-data/skrub/pull/1310 for a PR that adds a similar feature, and how it is tested.

Similarly, the PR is missing proper examples in the docstrings, and a relative example in the documentation. Again, check #1310 for more info on how this should be done.

So this PR needs a lot more work before it can be merged, and if you happen to need to meet a deadline with it you might want to reconsider working on this. Very often, contributions to open source software can take many months to be merged, and this looks like one that won't be merged anytime soon.

rcap107 avatar Nov 03 '25 11:11 rcap107

Hi @rcap107 ,

Thank you for your feedback. I will keep working on the PR. This work was indeed part of an assignment, but it doesn't matter, I will do my best to complete the PR.

Have a nice day.

JadeAffolabi avatar Nov 04 '25 08:11 JadeAffolabi

Hi @rcap107 ,

Thank you for your feedback. I will keep working on the PR. This work was indeed part of an assignment, but it doesn't matter, I will do my best to complete the PR.

Have a nice day.

That sounds good! What matters to me is that working on this may take a long time, but if that is ok with you then you are of course more than welcome to continue. Thanks for the effort, and feel free to ask if you need further help.

rcap107 avatar Nov 05 '25 12:11 rcap107

Hi @rcap107 , You said that the "PR is missing proper examples in the docstrings, and a relative example in the documentation." It seems you a talking about two types of examples. I assume that one is the type of example that shows a snippet of code to demonstrate how to use the new function and that example will be in the main function docstrings. But for the other type of example I don't get it.

I was also wondering if I should add the interdependence score (IdS) to the output of the column_associations function or if I should make the function that implements the IdS public.

JadeAffolabi avatar Nov 15 '25 16:11 JadeAffolabi

Hello, I realized that using norm 1 or norm 2 to calculate the IdS does not respect the following property : Ids(x,x) = 1, where a variable has perfect dependence with itself. That's why I propose implementing only the norm infinity (I had implemented all three).

What do you think?

JadeAffolabi avatar Nov 18 '25 07:11 JadeAffolabi

Hi @rcap107 , You said that the "PR is missing proper examples in the docstrings, and a relative example in the documentation." It seems you a talking about two types of examples. I assume that one is the type of example that shows a snippet of code to demonstrate how to use the new function and that example will be in the main function docstrings. But for the other type of example I don't get it.

In the docstring there should be a snippet of code that shows how you're supposed to call the function and what the result looks like on an example dataset. Then, there should be a script in the "examples" folder that explains gives some additional context on the operation. #1310 is a very good example to take inspiration from because it's a very similar feature and has both the docstring examples and the full script to check out.

You can also use this document to find more info on how to write an example.

It should be fairly short, and it's important to have one the gallery.

I was also wondering if I should add the interdependence score (IdS) to the output of the column_associations function or if I should make the function that implements the IdS public.

The function should be public, and the output should also be added to the Associations tab in the TableReport (although this part can be done in a separate PR).

From a quick look at the code, it seems like the output of the ids function is a square matrix. If that is the case, it should be modified so that it is a dataframe that contains the list of all the associations between the columns. Check out the _column_associations.py file where the _melt operation is done for what I mean.

rcap107 avatar Nov 18 '25 08:11 rcap107

Small note, this pr should get an entry in the changelog.

rcap107 avatar Nov 18 '25 08:11 rcap107

Hi @JadeAffolabi, I really appreciate the effort you're putting in this PR, especially considering it is quite a complex issue.

Unfortunately, as a result of it being a complex issue it is also something that takes quite a bit of time to review, to make sure it is working as intended and properly implements the original paper.

For your information, I won't be able to allocate the time necessary for a while more (likely, not until January), though if I can I'll try to leave some feedback before then.

rcap107 avatar Nov 24 '25 12:11 rcap107

Hello, No worries, take your time. If there are any other changes to be made, let me know.

JadeAffolabi avatar Nov 25 '25 10:11 JadeAffolabi