skrub icon indicating copy to clipboard operation
skrub copied to clipboard

Be more explicit about the max_*_columns in the TableReport

Open MarieSacksick opened this issue 1 month ago • 8 comments

During the sprint yesterday, with @JulietteBgl we found that the attributes max_*_columns were not clear. In particular, the attribute max_plot_columns. Indeed, as the attribute is an int, we expect to have at least this number of columns plotted, while actually it behaves as a bool: plot or do not plot.
I understand the necessity of having a maximum though, to avoid unexpected computational slowness.

plotting

I see two possible solutions here.

  1. keep the API and the configuration as is, but display the first N = max_plot_columns in the table report, with a message saying that not all columns have been plotted (if relevant) to avoid misleading
  2. keep the configuration settings, to still have a max, but change the attribute from max_plot_columns (int) to plot_columns (bool). There would be no default: the columns display if nb_columns < config.max_plot_columns. If set to True, it displays everything no matter the number of columns, if set to False, it plots nothing no matter the number of columns.

I personally like option 1 best, as I write option 2 I see that it takes longer to explain --> it's confusing.

association

This one is less obvious, because it doesn't make sense to compute the association on the N first columns. This is where the option 2 above makes more sense, to have something similar between plotting and association.

Related to https://github.com/skrub-data/skrub/issues/1523

MarieSacksick avatar Oct 30 '25 16:10 MarieSacksick

thanks @MarieSacksick and @JulietteBgl

I like your option 2. The parameter could be named something like

with_plots: Boolean, whether to generate histograms or not. default is "if it's cheap".

The definition of "cheap" is a threshold on the dataframe's width (and the threshold can be configured, but most users won't care about that).

I agree that when we take manual control and pass an explicit argument we just want to say "plot" or "don't plot"; controlling that indirectly through the threshold is needlessly complex. Another advantage of option 2 is that we can show the blue message only when plotting was skipped due to the threshold; if the user passes with_plots=False we just don't include the "distributions" tab at all (something for which we need a kind of workaround relying on a private API for the dataop's reports).

I think option 1. is less interesting because

  • As you said, it only works for tasks that can be stopped halfway through -- for example it cannot be used for the associations. So it adds one thing to understand without removing any, increasing the overall complexity. Option 2 is nicer in that respect because the exact same behavior can be reused for the associations, or any other operation we may want to make optional in the future
  • "only the first X columns are shown" is not tremendously easier to understand than "if there are more than X columns, skip doing Y", and does not make it much more obvious what I need to do if I just want to force plotting or not plotting for all columns.
  • I think it will waste time. In many cases the interesting columns will not happen to be grouped in the first 30. So either I don't want any plots and the time spent plotting the first 30 is wasted, or after seeing the first report I will want to see the plot for column 31, generate a new report while forcing to make all plots, and at this point I will generate plots for the first 30 for the second time. So in both cases overall I generate 30 plots I don't need. in other words trying to do only part of the work is likely to result in doing needlessly or doing it twice, unless we are confident that we can reliably identify which part is needed (I think probably not), or unless we can cache it (adds too much complexity).
  • it adds a little bit of complexity to the report generation and the template. really not a lot, but complexity builds up over time.

jeromedockes avatar Oct 31 '25 10:10 jeromedockes

it would be great to decide on this before 0.7.0 in case the max_*_columns need to be deprecated

jeromedockes avatar Nov 10 '25 18:11 jeromedockes

Do you need input from us or is it more a question for @GaelVaroquaux and @rcap107?

MarieSacksick avatar Nov 10 '25 20:11 MarieSacksick

The current behavior of the table report (a hidden boolean flag when the number of columns is bigger than the threshold) is something that I also find quite counterintuitive.

Given proper UI, I think that plotting only the first N columns would not be particularly complicated, and I'd find the behavior far more intuitive than "if we have more than N columns, no columns are plotted at all". That said, I agree on the other issues that would be introduced by implementing solution 1.

Sidenote: I wonder if we could have some kind of workaround by adding a selector that picks only the first N columns. Like:

SelectCols(cols=s.first_n_cols(30)).fit_transform(df)

I think that solution 2 would give a clearer idea of what's happening, so I'm in favor of solution 2.

rcap107 avatar Nov 14 '25 09:11 rcap107

From IRL discussion:

We should proceed with solution 2. Some implementation details

  • Rather than max_*_columns, the parameters should be renamed something like "plot_distributions" "compute_associations" (not sure about these names yet)
  • The flags should become boolean with the addition of the string from_config (this should be more explicit in saying you get the values from the config)
  • By default, both flags are set to True, and the threshold is set in the configuration as it is now
  • If the user wants to change the number of columns to plot, this should be done by altering the parameter in the configuration, e.g., with the context manager

Aside from the names, I think this is the gist of it

rcap107 avatar Nov 17 '25 16:11 rcap107

By default, both flags are set to True, and the threshold is set in the configuration as it is now

I think "from_config" (use the threshold) should be the default. The reason those parameters were introduced is to avoid a user inadvertently doing a report for a large dataframe and getting stuck for a long time or even running out of memory

jeromedockes avatar Nov 17 '25 18:11 jeromedockes

By default, both flags are set to True, and the threshold is set in the configuration as it is now

I think "from_config" (use the threshold) should be the default. The reason those parameters were introduced is to avoid a user inadvertently doing a report for a large dataframe and getting stuck for a long time or even running out of memory

Yes you're absolutely right

rcap107 avatar Nov 17 '25 18:11 rcap107

Noting it down here to remember: when we change the names of the parameters we need to also update the messages that are shown in the html templates like column_associations.html

rcap107 avatar Nov 18 '25 15:11 rcap107