Limit Cramer's V analysis to target column
Problem Description
I would like to get a quick idea of which features are important for predicting the target. For this, it would be nice if the TableReport had an option to display the Cramer's V of all features to a given target feature, instead of (or additional to) displaying Cramer's V between all features. (Additionally, it could be nice to make the table scrollable so we can see not just the top interactions but all of them.)
Feature Description
I see two options: Either the target variable is specified in the constructor of TableReport (probably easier to implement) or there is a drop-down menu in the Associations tab in the HTML overview from which the target variable can be selected.
Alternative Solutions
No response
Additional Context
No response
I wonder if such functionality should be in the TableReport, or rather in the column_association function, but not exposed in the TableReport. It seems a bit specific and low level, while the TableReport is very high level, and not specific to supervised learning (it has no idea of a "target")
I wonder if such functionality should be in the TableReport, or rather in the column_association function, but not exposed in the TableReport. It seems a bit specific and low level, while the TableReport is very high level, and not specific to supervised learning (it has no idea of a "target")
The actual computation is done in the _column_associations.py file, so that part will have to be modified in either case. It's more of a discussion whether the TableReport should be updated or not
I wonder if such functionality should be in the TableReport, or rather in the column_association function, but not exposed in the TableReport. It seems a bit specific and low level, while the TableReport is very high level, and not specific to supervised learning (it has no idea of a "target")
If you implement it via a drop-down in the TableReport HTML, it is simply a way to navigate the large amount of values and doesn't have to be tied to supervised learning. For the alternative suggestion with the extra parameter of TableReport, I agree that it would make the TableReport look more like a supervised learning tool.
If you implement it via a drop-down in the TableReport HTML, it is simply a way to navigate the large amount of values and doesn't have to be tied to supervised learning.
Indeed, but it will make the code around TableReport front end (the html and css code) more complicated, and I fear that these are already stretching a bit the limit of what it is "easy" to develop with (tenant of software design: complexity grows approximately with the square of the number of features, so one must be careful about adding features).
Given that the TableReport has the order_by parameter to specify a column to order by, maybe we can piggyback off that and report only the associations that are relative to that column 🤔
Hey @dholzmueller, would using something like skrub.column_association(df, y="my_target") works for you? This would output a dataframe like:
| Cramer's V | current_annual_salary |
|------------------------ |------------------------ |
| assignment_category | 0.635525 |
| date_first_hired | 0.0844172 |
| department | 0.183838 |
| department_name | 0.183838 |
| division | 0.172892 |
| employee_position_title | 0.250635 |
| gender | 0.0903027 |
| year_first_hired | 0.180557 |
In my case, there were ~150 features, so it would at least be good if the table was sorted. Although if I'm in the position of a casual user that doesn't know about this function it doesn't help me much.
Ok, and would it help to mention in the association section of the TableReport that you could use the column_association function yourself, with more options?
it might, assuming that I'm reading this section :)
as skrub has a focus on supervised learning having an optional 'target' parameter for the tablereport that causes it to show slightly different information might be a good idea, and something that @Vincent-Maladiere has brought up when it was first added to skrub
Should it be the same object (TableReport) or a different one? And the question holds both from a user's API point of view, and from a code complexity point of view
On Jun 23, 2025, 23:15, at 23:15, "Jérôme Dockès" @.***> wrote:
jeromedockes left a comment (skrub-data/skrub#1462)
as skrub has a focus on supervised learning having an optional 'target' parameter for the tablereport that causes it to show slightly different information might be a good idea, and something that @Vincent-Maladiere has brought up when it was first added to skrub
-- Reply to this email directly or view it on GitHub: https://github.com/skrub-data/skrub/issues/1462#issuecomment-2997818287 You are receiving this because you commented.
Message ID: @.***>
Should it be the same object (TableReport) or a different one? And the question holds both from a user's API point of view, and from a code complexity point of view
If we choose to go in this direction (which I like a lot), my guess is that this should be the same object from a user perspective
If we choose to go in this direction (which I like a lot), my guess is that this should be the same object from a user perspective
OK, I see the point. I do worry that it is going to make the TableReport codebase more complicated, and it's currently complicated.
In this case, what about having 2 private implementations, with a public one doing the dispatch? In the meantime, should we extend the column_association function as suggested above?
In this case, what about having 2 private implementations, with a public one doing the dispatch?
That might be a good idea. I'm for anything that makes the code simpler to read and manage.
In the meantime, should we extend the column_association function as suggested above?
Absolutely!