Jérôme Dockès
Jérôme Dockès
I think it is a different issue. BTW #2025 introduced a bug that we haven't had time to fix yet
To improve the documentation, I think there may be a misunderstanding that should be addressed early on in the introduction of `subsample`. why did you expect that taking a sample...
> detecting when one of X and y is sampled, but the other isn't I'm not sure this would be easy to detect reliably, as subsampling can be used anywhere....
thanks for explaining. indeed that should be clarified. maybe something like "it is similar to pd.DataFrame.sample() except that it becomes a no-op when `keep_subsampling=False`" would help
> I'm not sure this would be easy to detect reliably, as subsampling can be used anywhere. actually I'm not sure why I said that; something like #1465 should cover...
yes ATM it does not check cardinality for numeric columns. we could do it for integers and replace the histogram with a bar plot when the cardinality is low
I agree this would be useful; see also https://github.com/skrub-data/skrub/discussions/909#discussioncomment-9571065
> Very interesting! Could we list somewhere the kind of visualizations we would like to see for some objects? yes let's make a list and do a sprint :)
yes, it is thanks @Neilblaze !! I think the simplest thing would be to manually check the number of rows, columns, and size on disk and write that in the...
it does sound very useful! because we don't need a full sort but only the first and last 5 rows, and hydrating the templates is quite fast, computation time should...