Add sort buttons to `TableReport`'s "Table" tab
Problem Description
When using the TableReport on a dataframe with a small enough number of rows (e.g. less than 1e5 or maybe even 1e6), it would be nice to have sort buttons in the column headers of the default "Table" panel similarly to what is done for the "Stats" panel.
Feature Description
I suppose TableReport needs to precompute those views ahead of time, which can be costly when the number of samples or columns are very large. Still in many common usage cases where people deal with dataframes with a less than 1 million rows and a few hundred columns, that might be fast enough to keep the interactive experience.
Alternative Solutions
Alternatively, one could imagine on-demand computation when the user clicks on the button instead of precomputing but this would require using tools like ipywidget or streamlit which is probably out of scope for skrub.
Additional Context
No response
The TableReport is already accepting anorder_by parameter that allows to sort by a numerical or time column (though it has problems #1464)
I think the main issue with doing this interactively is that atm the report is only storing the first/last n_rows, and drawing the report around those; sorting arbitrary columns would mean pre-sorting everything, as well as keeping track of the entire table, and then redrawing dynamically every time.
From my understanding of the code, this would require rewriting a pretty large chunk of the TableReport code, though I may be wrong.
@jeromedockes will know
Hello Olivier, as @rcap107 explained, the TableReport gives a static view of a tiny subset of the data and doesn't hold a reference to the dataframe. Developing a dynamic version of the TableReport (which could look like hiplot) could bring a lot to the table (pun not intended). This could unlock a huge number of features like the sorting you mention, but that would also be a big heap of work.
As I explained in the description, it would be possible to precompute all the sorted previews (2 * n_columns). Assuming that n_columns (and n_rows) are small enough, it should be cheap enough to precompute all of those previews and can therefore be served in a static way.
Ah, I understand your point now. This could be switched on or off with a maximum column parameter. I don't see it as a priority, though. @GaelVaroquaux WDYT?
I don't see it as a priority, though. @GaelVaroquaux WDYT?
I agree. Not a priority, but good to have :).
Let's say that if someone comes up with a good PR on this feature that does not add too much code complexity and UX complexity, we'll take it :)
it does sound very useful! because we don't need a full sort but only the first and last 5 rows, and hydrating the templates is quite fast, computation time should be fine except for very large dataframes, which would be excluded by the threshold.
one possible drawback is it would make the reports bigger (pages slower to download, using more memory in the browser or IDE)
because we don't need a full sort but only the first and last 5 rows
It's true that polars and pandas have topk / nlargest operators that are more efficients than full-sorts (similar to argpartition in numpy).