skrub
skrub copied to clipboard
Misc improvements of table report
keeping track of a few things we discussed during and after euroscipy:
functionality
- [x] show the pandas index
- [x] arrow key navigation in table #1065
- [x] sortable 'stats' table #1068
- [x] on the summaries tab panel, start with the column selection empty. provide a tooltip saying to select some columns #1072
- [x] on the table and summaries tab, make the copy buttons always visible (not just on hover) ?
- [x] smaller text at least in tables?
- [ ] add plots for column associations
- [ ] on the bar plot with most frequent values, percentages are shown on the bar but it is easy to miss that the actual count is available by reading the x-axis ticklabels. maybe add the count on the bar but there isn't a lot of space, or make the x axis more salient somehow?
- [x] log scale in histograms?: we added outlier detection
- [x] option to show more rows in sample table?
- [x] make sure not to caputre events that have a modifier key down in the table or tabbed interface #1065
- [x] easier copying of table cell contents #1048
- [x] make sure the placeholder text and top bar in the table take the same vertical space #1058
- [x] replace the clipboard icon with 2 overlapping squares #1061
- [x] remove drop-down options from copyable text boxes #1058
- [x] add summary stats panel #1056
- [x] do not threshold column associations, always show top 20 #1060
- [x] thousands separator in dataframe shape display #1059
appearance
- [x] ~make tables not striped (alternating gray levels for rows), less contrast in gray levels, some styling of interactive elements?~ (we decided not to do it)
- [x] ~borders instead of shadows around cards?~ (we decided not to do it)
python interface / long-term
- [ ] have a way to say a column is the target or somehow special and adapt the display
Also, the TableReport returns an error when the dataframe contains a list:
import pandas as pd
from skrub import TableReport
TableReport(
pd.DataFrame(dict(a=[[1]]))
)
Traceback
"name": "TypeError",
"message": "unhashable type: 'list'",
"stack": "---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File ~/dev/inria/skrub/.venv/lib/python3.10/site-packages/IPython/core/formatters.py:347, in BaseFormatter.__call__(self, obj)
345 method = get_real_method(obj, self.print_method)
346 if method is not None:
--> 347 return method()
348 return None
349 else:
File ~/dev/inria/skrub/skrub/_reporting/_table_report.py:167, in TableReport._repr_html_(self)
166 def _repr_html_(self):
--> 167 return self._repr_mimebundle_()[\"text/html\"]
File ~/dev/inria/skrub/skrub/_reporting/_table_report.py:164, in TableReport._repr_mimebundle_(***failed resolving arguments***)
162 def _repr_mimebundle_(self, include=None, exclude=None):
163 del include, exclude
--> 164 return {\"text/html\": self.html_snippet()}
File ~/dev/inria/skrub/skrub/_reporting/_table_report.py:143, in TableReport.html_snippet(self)
134 def html_snippet(self):
135 \"\"\"Get the report as an HTML fragment that can be inserted in a page.
136
137 Returns
(...)
140 The HTML snippet.
141 \"\"\"
142 return to_html(
--> 143 self._summary_with_plots,
144 standalone=False,
145 column_filters=self.column_filters,
146 )
File ~/miniforge3/lib/python3.10/functools.py:981, in cached_property.__get__(self, instance, owner)
979 val = cache.get(self.attrname, _NOT_FOUND)
980 if val is _NOT_FOUND:
--> 981 val = self.func(instance)
982 try:
983 cache[self.attrname] = val
File ~/dev/inria/skrub/skrub/_reporting/_table_report.py:104, in TableReport._summary_with_plots(self)
102 @functools.cached_property
103 def _summary_with_plots(self):
--> 104 return summarize_dataframe(
105 self.dataframe, with_plots=True, title=self.title, **self._summary_kwargs
106 )
File ~/dev/inria/skrub/skrub/_reporting/_summarize.py:64, in summarize_dataframe(df, order_by, with_plots, title)
59 for position, column_name in enumerate(sbd.column_names(df)):
60 print(
61 f\"Processing column {position + 1: >3} / {n_columns}\", end=\"\\r\", flush=True
62 )
63 summary[\"columns\"].append(
---> 64 _summarize_column(
65 sbd.col(df, column_name),
66 position,
67 dataframe_summary=summary,
68 with_plots=with_plots,
69 order_by_column=None if order_by is None else sbd.col(df, order_by),
70 )
71 )
72 print(flush=True)
73 summary[\"n_constant_columns\"] = sum(
74 c[\"value_is_constant\"] for c in summary[\"columns\"]
75 )
File ~/dev/inria/skrub/skrub/_reporting/_summarize.py:104, in _summarize_column(column, position, dataframe_summary, with_plots, order_by_column)
102 summary[\"plot_names\"] = []
103 return summary
--> 104 _add_value_counts(
105 summary, column, dataframe_summary=dataframe_summary, with_plots=with_plots
106 )
107 _add_numeric_summary(
108 summary,
109 column,
(...)
112 order_by_column=order_by_column,
113 )
114 _add_datetime_summary(summary, column, with_plots=with_plots)
File ~/dev/inria/skrub/skrub/_reporting/_summarize.py:136, in _add_value_counts(summary, column, dataframe_summary, with_plots)
134 summary[\"high_cardinality\"] = True
135 return
--> 136 n_unique, value_counts = _utils.top_k_value_counts(column, k=10)
137 # if the column contains all nulls, _add_value_counts does not get called
138 assert n_unique > 0
File ~/dev/inria/skrub/skrub/_reporting/_utils.py:48, in top_k_value_counts(column, k)
46 counts = sbd.sort(counts, by=\"count\", descending=True)
47 counts = sbd.slice(counts, k)
---> 48 return n_unique, dict(zip(*to_dict(counts).values()))
TypeError: unhashable type: 'list'"
Also, the TableReport returns an error when the dataframe contains a list:
thanks for reporting it. it is because the unique value counts are stored in a dict and in this case the value is not hashable. I'll open a separate issue for that
For the appearance, my vote goes to striped table, shadows on cards, no selected cell outline. If there's a poll somewhere I'm happy to share it there