skrub icon indicating copy to clipboard operation
skrub copied to clipboard

Limit Cramer's V analysis to target column

Open dholzmueller opened this issue 6 months ago • 15 comments

Problem Description

I would like to get a quick idea of which features are important for predicting the target. For this, it would be nice if the TableReport had an option to display the Cramer's V of all features to a given target feature, instead of (or additional to) displaying Cramer's V between all features. (Additionally, it could be nice to make the table scrollable so we can see not just the top interactions but all of them.)

Feature Description

I see two options: Either the target variable is specified in the constructor of TableReport (probably easier to implement) or there is a drop-down menu in the Associations tab in the HTML overview from which the target variable can be selected.

Alternative Solutions

No response

Additional Context

No response

dholzmueller avatar Jun 20 '25 13:06 dholzmueller

I wonder if such functionality should be in the TableReport, or rather in the column_association function, but not exposed in the TableReport. It seems a bit specific and low level, while the TableReport is very high level, and not specific to supervised learning (it has no idea of a "target")

GaelVaroquaux avatar Jun 20 '25 13:06 GaelVaroquaux

I wonder if such functionality should be in the TableReport, or rather in the column_association function, but not exposed in the TableReport. It seems a bit specific and low level, while the TableReport is very high level, and not specific to supervised learning (it has no idea of a "target")

The actual computation is done in the _column_associations.py file, so that part will have to be modified in either case. It's more of a discussion whether the TableReport should be updated or not

rcap107 avatar Jun 20 '25 13:06 rcap107

I wonder if such functionality should be in the TableReport, or rather in the column_association function, but not exposed in the TableReport. It seems a bit specific and low level, while the TableReport is very high level, and not specific to supervised learning (it has no idea of a "target")

If you implement it via a drop-down in the TableReport HTML, it is simply a way to navigate the large amount of values and doesn't have to be tied to supervised learning. For the alternative suggestion with the extra parameter of TableReport, I agree that it would make the TableReport look more like a supervised learning tool.

dholzmueller avatar Jun 20 '25 13:06 dholzmueller

If you implement it via a drop-down in the TableReport HTML, it is simply a way to navigate the large amount of values and doesn't have to be tied to supervised learning.

Indeed, but it will make the code around TableReport front end (the html and css code) more complicated, and I fear that these are already stretching a bit the limit of what it is "easy" to develop with (tenant of software design: complexity grows approximately with the square of the number of features, so one must be careful about adding features).

GaelVaroquaux avatar Jun 20 '25 14:06 GaelVaroquaux

Given that the TableReport has the order_by parameter to specify a column to order by, maybe we can piggyback off that and report only the associations that are relative to that column 🤔

rcap107 avatar Jun 23 '25 12:06 rcap107

Hey @dholzmueller, would using something like skrub.column_association(df, y="my_target") works for you? This would output a dataframe like:

| Cramer's V              |   current_annual_salary |
|------------------------ |------------------------ |
| assignment_category     |               0.635525  |
| date_first_hired        |               0.0844172 |
| department              |               0.183838  |
| department_name         |               0.183838  |
| division                |               0.172892  |
| employee_position_title |               0.250635  |
| gender                  |               0.0903027 |
| year_first_hired        |               0.180557  |

Vincent-Maladiere avatar Jun 23 '25 14:06 Vincent-Maladiere

In my case, there were ~150 features, so it would at least be good if the table was sorted. Although if I'm in the position of a casual user that doesn't know about this function it doesn't help me much.

dholzmueller avatar Jun 23 '25 14:06 dholzmueller

Ok, and would it help to mention in the association section of the TableReport that you could use the column_association function yourself, with more options?

Vincent-Maladiere avatar Jun 23 '25 16:06 Vincent-Maladiere

it might, assuming that I'm reading this section :)

dholzmueller avatar Jun 23 '25 16:06 dholzmueller

as skrub has a focus on supervised learning having an optional 'target' parameter for the tablereport that causes it to show slightly different information might be a good idea, and something that @Vincent-Maladiere has brought up when it was first added to skrub

jeromedockes avatar Jun 23 '25 20:06 jeromedockes

Should it be the same object (TableReport) or a different one? And the question holds both from a user's API point of view, and from a code complexity point of view

On Jun 23, 2025, 23:15, at 23:15, "Jérôme Dockès" @.***> wrote:

jeromedockes left a comment (skrub-data/skrub#1462)

as skrub has a focus on supervised learning having an optional 'target' parameter for the tablereport that causes it to show slightly different information might be a good idea, and something that @Vincent-Maladiere has brought up when it was first added to skrub

-- Reply to this email directly or view it on GitHub: https://github.com/skrub-data/skrub/issues/1462#issuecomment-2997818287 You are receiving this because you commented.

Message ID: @.***>

GaelVaroquaux avatar Jun 24 '25 05:06 GaelVaroquaux

Should it be the same object (TableReport) or a different one? And the question holds both from a user's API point of view, and from a code complexity point of view

If we choose to go in this direction (which I like a lot), my guess is that this should be the same object from a user perspective

Vincent-Maladiere avatar Jun 24 '25 08:06 Vincent-Maladiere

If we choose to go in this direction (which I like a lot), my guess is that this should be the same object from a user perspective

OK, I see the point. I do worry that it is going to make the TableReport codebase more complicated, and it's currently complicated.

GaelVaroquaux avatar Jun 24 '25 13:06 GaelVaroquaux

In this case, what about having 2 private implementations, with a public one doing the dispatch? In the meantime, should we extend the column_association function as suggested above?

Vincent-Maladiere avatar Jun 24 '25 14:06 Vincent-Maladiere

In this case, what about having 2 private implementations, with a public one doing the dispatch?

That might be a good idea. I'm for anything that makes the code simpler to read and manage.

In the meantime, should we extend the column_association function as suggested above?

Absolutely!

GaelVaroquaux avatar Jul 06 '25 17:07 GaelVaroquaux