skrub icon indicating copy to clipboard operation
skrub copied to clipboard

FIX - TableReport behavior when `max_plot_columns` is set to `None`

Open jeromedockes opened this issue 5 months ago • 10 comments

  • how do I set max_plot_columns to always plot? I would expect setting it to None to mean "no maximum" but it still uses the default 30
  • similarly (less important) there does not seem to be a way to set subsampling seed to None. maybe not a big deal since this is risky due to the possibility of subsampling several items separately
  • BUG: setting enable_subsampling to "force" will also subsample even during predict; we never want subsampling in that case (y_score would have a different shape than y_true)

jeromedockes avatar Jul 17 '25 11:07 jeromedockes

  • how do I set max_plot_columns to always plot? I would expect setting it to None to mean "no maximum" but it still uses the default 30

hah, that's the thing, you don't

_check_max_cols in _table_report.py is replacing None with the value, but then this means that in _summary() the value is never None, so that's also a bug that must be fixed

At the moment, None means "use the default", I am wondering if we should change the behavior of the None, or add a -1 special case for "plot all the columns"

rcap107 avatar Jul 17 '25 13:07 rcap107

  • BUG: setting enable_subsampling to "force" will also subsample even during predict; we never want subsampling in that case (y_score would have a different shape than y_true)

For this one, something like this should do the trick, no? 🤔

def _should_subsample(mode, environment):
    enable_subsampling = get_config()["enable_subsampling"]
    if enable_subsampling == "disable":
        return False
    if enable_subsampling == "force" and mode != "predict":
        return True
    if mode == "preview":
        return True
    if "fit" not in mode:
        return False
    return environment.get(SHOULD_SUBSAMPLE_KEY, False)

rcap107 avatar Jul 17 '25 13:07 rcap107

it should be "if 'fit' not in mode" as you can see on the last if statement -- there are other modes like "score" or even "transform" where there must be no subsampling so the correct check is to see if the mode is in ["fit", "fit_transform"]. once this is fixed it is still a bit strange that the global config takes precedence over the parameter passed to the function, but I guess that is ok as the default global config is "default"

jeromedockes avatar Jul 17 '25 13:07 jeromedockes

so it would be something like

def _should_subsample(mode, environment):
    enable_subsampling = get_config()["enable_subsampling"]
    if "fit" not in mode:
        return False
    if enable_subsampling == "disable":
        return False
    if mode == "preview":
        return True
    if enable_subsampling == "force":
        return True
    return environment.get(SHOULD_SUBSAMPLE_KEY, False)

jeromedockes avatar Jul 17 '25 13:07 jeromedockes

this still doesn't plot regardless of the number of columns:

import skrub
import polars as pl

df = pl.DataFrame({f'col_{i}': list(range(10)) for i in range(50)})
skrub.TableReport(df, max_plot_columns=None).open()

IMO "no maximum" is the intuitive interpretation of max=None, and the doc does not say what happens when the argument is None (the default)

jeromedockes avatar Sep 15 '25 06:09 jeromedockes

To fully close this issue, setting max_plot_columns=None and max_association_columns=None should result in plotting/computing the associations for all columns.

This should be done in the table report, where summarization is done.

rcap107 avatar Oct 27 '25 14:10 rcap107

I'll take that!

JulietteBgl avatar Oct 29 '25 10:10 JulietteBgl

thanks @JulietteBgl!

For distinguishing between the default and None I suggest adding some sentinel values like this in the config or utils module:

class Config(enum.Enum):
    max_plot_columns = enum.auto()
    max_association_columns = enum.auto()

    def __repr__(self):
        return f"skrub.get_config()['{self.name}']"

    def __str__(self):
        return self.__repr__()

then the default can be set to Config.max_plot_columns which is clearer than None both in the source code and in the html docs where it is rendered like this:

Image

the check to resolve the max_plot_columns to an actual number could then look something like this:

    if max_plot_columns is _config.Config.max_plot_columns:
        max_plot_columns = _config.get_config()["max_plot_columns"]
    elif max_plot_columns is None:
        max_plot_columns = float('inf')
    else:
        max_plot_columns = int(max_plot_columns)

jeromedockes avatar Oct 29 '25 10:10 jeromedockes

Note ATM there is a circular import between the _reporting and _config modules, as just importing _config patches the pandas and polars displays, so the enum would probably have to go in skrub._utils

actually I don't think it is a good idea to do something so intrusive as actually patching pandas and polars just upon importing skrub, especially as the config key is very generic ("use_table_report") it is surprising that just importing skrub would modify the behavior of pandas. so it might be better to keep this out of config and in the patch_display_functions (I realize I said the opposite in the original issue for adding the config). but in any case that would be discussed in a separate issue @rcap107

jeromedockes avatar Oct 29 '25 10:10 jeromedockes

actually I don't think it is a good idea to do something so intrusive as actually patching pandas and polars just upon importing skrub,

+1

GaelVaroquaux avatar Oct 29 '25 13:10 GaelVaroquaux