Improve the functionality of the TableReport plots when `order_by` is set
When order_by is set, numerical features are plotted as a function of the variable in order_by, however the plots do not make a lot of sense (see image)
from skrub import TableReport
from skrub.datasets import fetch_medical_charge
df = fetch_medical_charge().X
TableReport(df, order_by="Average_Medicare_Payments").open()
Ideally, these should be plotted using a scatterplot, but scatterplots are very expensive to prepare for large datasets, so we can't plug one in directly.
What about subsampling the scatterplots?
I think that you are pushing a bit the meaning of "order_by" and trying to make it do something that is not related to ordering.
I think that you are pushing a bit the meaning of "order_by" and trying to make it do something that is not related to ordering.
@GaelVaroquaux did you mean to post this in #1462?
The order_by function I'm talking about is already in the TableReport, and is currently broken as shown
What about subsampling the scatterplots?
It could be that, or it could be a 2d histogram (or something along those lines)
I opened the issue to keep track of it and see if people agree it's worth working on. I've already given it a shot some time ago, but it's more complicated than I had first thought.
I don't understand:
- Why the above plots are broken
- Why order_by should trigger scatterplots of 2D histograms.
For the sake of communication and making sure that we don't implement features without a coherence and a user experience in mind, in general it is important to explain the user story and the design.
Currently, the TableReport has the order_by parameter that can take a column, and that plots all numerical columns as functions of the given column if it is numerical or a datetime.
With the current parameters, each numerical feature is sorted by the "order_by" column, then the plot function is trying to connect all the "y" points as a sequence according to the x in the "order_by" column.
To me, this does not make a lot of sense because it feels like I am adding an order that isn't there in the original data, and all the connections between the dots are just adding noise.
Instead, I would rather have a scatterplot to just show the distribution of points as a function of the order_by column like the examples below:
Whether it's a scatterplot, a 2d histogram or a KDE plot is secondary, to me what is important is not having a line plot when it's unclear if there is an order relationship between the dots.
A much bigger problem I found while testing out the different plots is that setting order_by in a large dataset causes the TableReport to take an extremely long time to run, so regardless of the solution we go for, we will have to address that issue in some way.
I think the original rationale for order_by was to plot time series, which would have been a histogram otherwise. However, I agree with Gael that using it for 2D plots seems like a stretch and would require properly handling 2D distributions.
Perhaps out of scope, but later we could pass a target column to the TableReport and plot distributions between the features and this target, as well as compute associations only against it.
Now I get the reasoning. I am fine with shelving the 2d plot idea, but the problem of plotting not even that large datasets with order_by set to anything remains, and I think we should address that part at least.
as @Vincent-Maladiere says order_by is something that what kept from early versions of the tablereport (skrubview at the time) for time series. as it's not really documented I doubt anyone used it and if it is causing problems maybe the easiest thing is to deprecate it.
Indeed specifying a target column and showing plots that take that into account and only computing correlations / cramer associations with the target sounds more useful
it's not really documented I doubt anyone used it and if it is causing problems maybe the easiest thing is to deprecate it.
OK, I see two ways forward:
-
Deprecate and remove
-
Clarify usecase and semantics, and work from here on improving the functionality
I'm leaning towards option 2, because having fast time series plots is an excellent feature IMO
Going back to this issue:
If we go with option 2) (2. Clarify usecase and semantics, and work from here on improving the functionality), should we disable order_by for non-datetime columns? Either plotting the default distribution, or not plotting at all