skrub icon indicating copy to clipboard operation
skrub copied to clipboard

Improve the functionality of the TableReport plots when `order_by` is set

Open rcap107 opened this issue 6 months ago • 13 comments

When order_by is set, numerical features are plotted as a function of the variable in order_by, however the plots do not make a lot of sense (see image)

from skrub import TableReport
from skrub.datasets import fetch_medical_charge
df = fetch_medical_charge().X
TableReport(df, order_by="Average_Medicare_Payments").open()

Image

Ideally, these should be plotted using a scatterplot, but scatterplots are very expensive to prepare for large datasets, so we can't plug one in directly.

rcap107 avatar Jun 23 '25 12:06 rcap107

What about subsampling the scatterplots?

Vincent-Maladiere avatar Jun 23 '25 14:06 Vincent-Maladiere

I think that you are pushing a bit the meaning of "order_by" and trying to make it do something that is not related to ordering.

GaelVaroquaux avatar Jun 23 '25 14:06 GaelVaroquaux

I think that you are pushing a bit the meaning of "order_by" and trying to make it do something that is not related to ordering.

@GaelVaroquaux did you mean to post this in #1462?

The order_by function I'm talking about is already in the TableReport, and is currently broken as shown

rcap107 avatar Jun 23 '25 14:06 rcap107

What about subsampling the scatterplots?

It could be that, or it could be a 2d histogram (or something along those lines)

I opened the issue to keep track of it and see if people agree it's worth working on. I've already given it a shot some time ago, but it's more complicated than I had first thought.

rcap107 avatar Jun 23 '25 14:06 rcap107

I don't understand:

  1. Why the above plots are broken
  2. Why order_by should trigger scatterplots of 2D histograms.

For the sake of communication and making sure that we don't implement features without a coherence and a user experience in mind, in general it is important to explain the user story and the design.

GaelVaroquaux avatar Jun 23 '25 14:06 GaelVaroquaux

Currently, the TableReport has the order_by parameter that can take a column, and that plots all numerical columns as functions of the given column if it is numerical or a datetime.

With the current parameters, each numerical feature is sorted by the "order_by" column, then the plot function is trying to connect all the "y" points as a sequence according to the x in the "order_by" column.

Image

To me, this does not make a lot of sense because it feels like I am adding an order that isn't there in the original data, and all the connections between the dots are just adding noise.

Instead, I would rather have a scatterplot to just show the distribution of points as a function of the order_by column like the examples below:

Image

Whether it's a scatterplot, a 2d histogram or a KDE plot is secondary, to me what is important is not having a line plot when it's unclear if there is an order relationship between the dots.

rcap107 avatar Jun 23 '25 15:06 rcap107

A much bigger problem I found while testing out the different plots is that setting order_by in a large dataset causes the TableReport to take an extremely long time to run, so regardless of the solution we go for, we will have to address that issue in some way.

rcap107 avatar Jun 23 '25 15:06 rcap107

I think the original rationale for order_by was to plot time series, which would have been a histogram otherwise. However, I agree with Gael that using it for 2D plots seems like a stretch and would require properly handling 2D distributions.

Perhaps out of scope, but later we could pass a target column to the TableReport and plot distributions between the features and this target, as well as compute associations only against it.

Vincent-Maladiere avatar Jun 23 '25 16:06 Vincent-Maladiere

Now I get the reasoning. I am fine with shelving the 2d plot idea, but the problem of plotting not even that large datasets with order_by set to anything remains, and I think we should address that part at least.

rcap107 avatar Jun 23 '25 16:06 rcap107

as @Vincent-Maladiere says order_by is something that what kept from early versions of the tablereport (skrubview at the time) for time series. as it's not really documented I doubt anyone used it and if it is causing problems maybe the easiest thing is to deprecate it.

Indeed specifying a target column and showing plots that take that into account and only computing correlations / cramer associations with the target sounds more useful

jeromedockes avatar Jun 23 '25 20:06 jeromedockes

it's not really documented I doubt anyone used it and if it is causing problems maybe the easiest thing is to deprecate it.

OK, I see two ways forward:

  1. Deprecate and remove

  2. Clarify usecase and semantics, and work from here on improving the functionality

GaelVaroquaux avatar Jun 24 '25 06:06 GaelVaroquaux

I'm leaning towards option 2, because having fast time series plots is an excellent feature IMO

Vincent-Maladiere avatar Jun 24 '25 08:06 Vincent-Maladiere

Going back to this issue:

If we go with option 2) (2. Clarify usecase and semantics, and work from here on improving the functionality), should we disable order_by for non-datetime columns? Either plotting the default distribution, or not plotting at all

rcap107 avatar Oct 10 '25 08:10 rcap107