skrub icon indicating copy to clipboard operation
skrub copied to clipboard

Misc improvements of table report

Open jeromedockes opened this issue 1 year ago • 3 comments

keeping track of a few things we discussed during and after euroscipy:

functionality

  • [x] show the pandas index
  • [x] arrow key navigation in table #1065
  • [x] sortable 'stats' table #1068
  • [x] on the summaries tab panel, start with the column selection empty. provide a tooltip saying to select some columns #1072
  • [x] on the table and summaries tab, make the copy buttons always visible (not just on hover) ?
  • [x] smaller text at least in tables?
  • [ ] add plots for column associations
  • [ ] on the bar plot with most frequent values, percentages are shown on the bar but it is easy to miss that the actual count is available by reading the x-axis ticklabels. maybe add the count on the bar but there isn't a lot of space, or make the x axis more salient somehow?
  • [x] log scale in histograms?: we added outlier detection
  • [x] option to show more rows in sample table?
  • [x] make sure not to caputre events that have a modifier key down in the table or tabbed interface #1065
  • [x] easier copying of table cell contents #1048
  • [x] make sure the placeholder text and top bar in the table take the same vertical space #1058
  • [x] replace the clipboard icon with 2 overlapping squares #1061
  • [x] remove drop-down options from copyable text boxes #1058
  • [x] add summary stats panel #1056
  • [x] do not threshold column associations, always show top 20 #1060
  • [x] thousands separator in dataframe shape display #1059

appearance

  • [x] ~make tables not striped (alternating gray levels for rows), less contrast in gray levels, some styling of interactive elements?~ (we decided not to do it)
  • [x] ~borders instead of shadows around cards?~ (we decided not to do it)

python interface / long-term

  • [ ] have a way to say a column is the target or somehow special and adapt the display

jeromedockes avatar Sep 03 '24 13:09 jeromedockes

Also, the TableReport returns an error when the dataframe contains a list:

import pandas as pd
from skrub import TableReport

TableReport(
    pd.DataFrame(dict(a=[[1]]))
)
Traceback
	"name": "TypeError",
	"message": "unhashable type: 'list'",
	"stack": "---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/dev/inria/skrub/.venv/lib/python3.10/site-packages/IPython/core/formatters.py:347, in BaseFormatter.__call__(self, obj)
    345     method = get_real_method(obj, self.print_method)
    346     if method is not None:
--> 347         return method()
    348     return None
    349 else:

File ~/dev/inria/skrub/skrub/_reporting/_table_report.py:167, in TableReport._repr_html_(self)
    166 def _repr_html_(self):
--> 167     return self._repr_mimebundle_()[\"text/html\"]

File ~/dev/inria/skrub/skrub/_reporting/_table_report.py:164, in TableReport._repr_mimebundle_(***failed resolving arguments***)
    162 def _repr_mimebundle_(self, include=None, exclude=None):
    163     del include, exclude
--> 164     return {\"text/html\": self.html_snippet()}

File ~/dev/inria/skrub/skrub/_reporting/_table_report.py:143, in TableReport.html_snippet(self)
    134 def html_snippet(self):
    135     \"\"\"Get the report as an HTML fragment that can be inserted in a page.
    136 
    137     Returns
   (...)
    140         The HTML snippet.
    141     \"\"\"
    142     return to_html(
--> 143         self._summary_with_plots,
    144         standalone=False,
    145         column_filters=self.column_filters,
    146     )

File ~/miniforge3/lib/python3.10/functools.py:981, in cached_property.__get__(self, instance, owner)
    979 val = cache.get(self.attrname, _NOT_FOUND)
    980 if val is _NOT_FOUND:
--> 981     val = self.func(instance)
    982     try:
    983         cache[self.attrname] = val

File ~/dev/inria/skrub/skrub/_reporting/_table_report.py:104, in TableReport._summary_with_plots(self)
    102 @functools.cached_property
    103 def _summary_with_plots(self):
--> 104     return summarize_dataframe(
    105         self.dataframe, with_plots=True, title=self.title, **self._summary_kwargs
    106     )

File ~/dev/inria/skrub/skrub/_reporting/_summarize.py:64, in summarize_dataframe(df, order_by, with_plots, title)
     59 for position, column_name in enumerate(sbd.column_names(df)):
     60     print(
     61         f\"Processing column {position + 1: >3} / {n_columns}\", end=\"\\r\", flush=True
     62     )
     63     summary[\"columns\"].append(
---> 64         _summarize_column(
     65             sbd.col(df, column_name),
     66             position,
     67             dataframe_summary=summary,
     68             with_plots=with_plots,
     69             order_by_column=None if order_by is None else sbd.col(df, order_by),
     70         )
     71     )
     72 print(flush=True)
     73 summary[\"n_constant_columns\"] = sum(
     74     c[\"value_is_constant\"] for c in summary[\"columns\"]
     75 )

File ~/dev/inria/skrub/skrub/_reporting/_summarize.py:104, in _summarize_column(column, position, dataframe_summary, with_plots, order_by_column)
    102     summary[\"plot_names\"] = []
    103     return summary
--> 104 _add_value_counts(
    105     summary, column, dataframe_summary=dataframe_summary, with_plots=with_plots
    106 )
    107 _add_numeric_summary(
    108     summary,
    109     column,
   (...)
    112     order_by_column=order_by_column,
    113 )
    114 _add_datetime_summary(summary, column, with_plots=with_plots)

File ~/dev/inria/skrub/skrub/_reporting/_summarize.py:136, in _add_value_counts(summary, column, dataframe_summary, with_plots)
    134     summary[\"high_cardinality\"] = True
    135     return
--> 136 n_unique, value_counts = _utils.top_k_value_counts(column, k=10)
    137 # if the column contains all nulls, _add_value_counts does not get called
    138 assert n_unique > 0

File ~/dev/inria/skrub/skrub/_reporting/_utils.py:48, in top_k_value_counts(column, k)
     46 counts = sbd.sort(counts, by=\"count\", descending=True)
     47 counts = sbd.slice(counts, k)
---> 48 return n_unique, dict(zip(*to_dict(counts).values()))

TypeError: unhashable type: 'list'"

Vincent-Maladiere avatar Sep 10 '24 12:09 Vincent-Maladiere

Also, the TableReport returns an error when the dataframe contains a list:

thanks for reporting it. it is because the unique value counts are stored in a dict and in this case the value is not hashable. I'll open a separate issue for that

jeromedockes avatar Sep 10 '24 12:09 jeromedockes

For the appearance, my vote goes to striped table, shadows on cards, no selected cell outline. If there's a poll somewhere I'm happy to share it there

TheooJ avatar Sep 19 '24 19:09 TheooJ