FIX - TableReport behavior when `max_plot_columns` is set to `None`
- how do I set
max_plot_columnsto always plot? I would expect setting it to None to mean "no maximum" but it still uses the default 30 - similarly (less important) there does not seem to be a way to set subsampling seed to None. maybe not a big deal since this is risky due to the possibility of subsampling several items separately
- BUG: setting
enable_subsamplingto "force" will also subsample even duringpredict; we never want subsampling in that case (y_scorewould have a different shape thany_true)
- how do I set
max_plot_columnsto always plot? I would expect setting it to None to mean "no maximum" but it still uses the default 30
hah, that's the thing, you don't
_check_max_cols in _table_report.py is replacing None with the value, but then this means that in _summary() the value is never None, so that's also a bug that must be fixed
At the moment, None means "use the default", I am wondering if we should change the behavior of the None, or add a -1 special case for "plot all the columns"
- BUG: setting
enable_subsamplingto "force" will also subsample even duringpredict; we never want subsampling in that case (y_scorewould have a different shape thany_true)
For this one, something like this should do the trick, no? 🤔
def _should_subsample(mode, environment):
enable_subsampling = get_config()["enable_subsampling"]
if enable_subsampling == "disable":
return False
if enable_subsampling == "force" and mode != "predict":
return True
if mode == "preview":
return True
if "fit" not in mode:
return False
return environment.get(SHOULD_SUBSAMPLE_KEY, False)
it should be "if 'fit' not in mode" as you can see on the last if statement -- there are other modes like "score" or even "transform" where there must be no subsampling so the correct check is to see if the mode is in ["fit", "fit_transform"]. once this is fixed it is still a bit strange that the global config takes precedence over the parameter passed to the function, but I guess that is ok as the default global config is "default"
so it would be something like
def _should_subsample(mode, environment):
enable_subsampling = get_config()["enable_subsampling"]
if "fit" not in mode:
return False
if enable_subsampling == "disable":
return False
if mode == "preview":
return True
if enable_subsampling == "force":
return True
return environment.get(SHOULD_SUBSAMPLE_KEY, False)
this still doesn't plot regardless of the number of columns:
import skrub
import polars as pl
df = pl.DataFrame({f'col_{i}': list(range(10)) for i in range(50)})
skrub.TableReport(df, max_plot_columns=None).open()
IMO "no maximum" is the intuitive interpretation of max=None, and the doc does not say what happens when the argument is None (the default)
To fully close this issue, setting max_plot_columns=None and max_association_columns=None should result in plotting/computing the associations for all columns.
This should be done in the table report, where summarization is done.
I'll take that!
thanks @JulietteBgl!
For distinguishing between the default and None I suggest adding some sentinel values like this in the config or utils module:
class Config(enum.Enum):
max_plot_columns = enum.auto()
max_association_columns = enum.auto()
def __repr__(self):
return f"skrub.get_config()['{self.name}']"
def __str__(self):
return self.__repr__()
then the default can be set to Config.max_plot_columns which is clearer than None both in the source code and in the html docs where it is rendered like this:
the check to resolve the max_plot_columns to an actual number could then look something like this:
if max_plot_columns is _config.Config.max_plot_columns:
max_plot_columns = _config.get_config()["max_plot_columns"]
elif max_plot_columns is None:
max_plot_columns = float('inf')
else:
max_plot_columns = int(max_plot_columns)
Note ATM there is a circular import between the _reporting and _config modules, as just importing _config patches the pandas and polars displays, so the enum would probably have to go in skrub._utils
actually I don't think it is a good idea to do something so intrusive as actually patching pandas and polars just upon importing skrub, especially as the config key is very generic ("use_table_report") it is surprising that just importing skrub would modify the behavior of pandas. so it might be better to keep this out of config and in the patch_display_functions (I realize I said the opposite in the original issue for adding the config). but in any case that would be discussed in a separate issue @rcap107
actually I don't think it is a good idea to do something so intrusive as actually patching pandas and polars just upon importing skrub,
+1