ibis feat: optimize table info() and describe() for large column tables efficiently

feat: optimize table info() and describe() for large column tables efficiently

Open jitingxu1 opened this issue 7 months ago • 7 comments

Is your feature request related to a problem?

we have table.info() and table.describe() for Ibis table. The function loops over each column and performs multiple aggregations and form the output by unioning all the stats for each column

it is often used for univariate analysis by DS, When I tried these two functions on a bigger dataset (465 columns, 1.5m rows), I have two issues:

It took forever to get the results for the table.info(),
describe() throws the Out of Memory exception

The info() majorly calculate the sum, mean of null rows, but it is very slow. Is it very expensive to calculate this on duckDB?

for pos, colname in enumerate(self.columns):
            col = self[colname]
            typ = col.type()
            agg = self.select(
                isna=ibis.case().when(col.isnull(), 1).else_(0).end()
            ).agg(
                name=lit(colname),
                type=lit(str(typ)),
                nullable=lit(typ.nullable),
                nulls=lambda t: t.isna.sum(),
                non_nulls=lambda t: (1 - t.isna).sum(),
                null_frac=lambda t: t.isna.mean(),
                pos=lit(pos, type=dt.int16),
            )
            aggs.append(agg)
        return ibis.union(*aggs).order_by(ibis.asc("pos"))

The describe() throws Out of Memory exception, it will generate 465 small tables for the union, not sure why it takes so much memory.

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/ibis/expr/types/core.py:9 │
│ 9 in __rich_console__                                                                            │
│                                                                                                  │
│    96 │   │                                                                                      │
│    97 │   │   try:                                                                               │
│    98 │   │   │   if opts.interactive:                                                           │
│ ❱  99 │   │   │   │   rich_object = to_rich(self, console_width=console_width)                   │
│   100 │   │   │   else:                                                                          │
│   101 │   │   │   │   rich_object = Text(self._noninteractive_repr())                            │
│   102 │   │   except Exception as e:                                                             │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/ibis/expr/types/pretty.py │
│ :273 in to_rich                                                                                  │
│                                                                                                  │
│   270 │   │   │   expr, max_length=max_length, max_string=max_string, max_depth=max_depth        │
│   271 │   │   )                                                                                  │
│   272 │   else:                                                                                  │
│ ❱ 273 │   │   return _to_rich_table(                                                             │
│   274 │   │   │   expr,                                                                          │
│   275 │   │   │   max_rows=max_rows,                                                             │
│   276 │   │   │   max_columns=max_columns,                                                       │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/ibis/expr/types/pretty.py │
│ :345 in _to_rich_table                                                                           │
│                                                                                                  │
│   342 │   │   if orig_ncols > len(computed_cols):                                                │
│   343 │   │   │   table = table.select(*computed_cols)                                           │
│   344 │                                                                                          │
│ ❱ 345 │   result = table.limit(max_rows + 1).to_pyarrow()                                        │
│   346 │   # Now format the columns in order, stopping if the console width would                 │
│   347 │   # be exceeded.                                                                         │
│   348 │   col_info = []                                                                          │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/ibis/expr/types/core.py:4 │
│ 86 in to_pyarrow                                                                                 │
│                                                                                                  │
│   483 │   │   Table                                                                              │
│   484 │   │   │   A pyarrow table holding the results of the executed expression.                │
│   485 │   │   """                                                                                │
│ ❱ 486 │   │   return self._find_backend(use_default=True).to_pyarrow(                            │
│   487 │   │   │   self, params=params, limit=limit, **kwargs                                     │
│   488 │   │   )                                                                                  │
│   489                                                                                            │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/ibis/backends/duckdb/__in │
│ it__.py:1375 in to_pyarrow                                                                       │
│                                                                                                  │
│   1372 │   │   limit: int | str | None = None,                                                   │
│   1373 │   │   **_: Any,                                                                         │
│   1374 │   ) -> pa.Table:                                                                        │
│ ❱ 1375 │   │   table = self._to_duckdb_relation(expr, params=params, limit=limit).arrow()        │
│   1376 │   │   return expr.__pyarrow_result__(table)                                             │
│   1377 │                                                                                         │
│   1378 │   def execute(                                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryException: Out of Memory Error: failed to offload data block of size 256.0 KiB (19.6 GiB/19.6 GiB used).
This limit was set by the 'max_temp_directory_size' setting.
By default, this setting utilizes the available disk space on the drive where the 'temp_directory' is located.
You can adjust this setting, by using (for example) PRAGMA max_temp_directory_size='10GiB'

What is the motivation behind your request?

These two functions are very useful for univariate analysis.

Describe the solution you'd like

Could we do some batch or parallel computing?

What version of ibis are you running?

9.1.0

What backend(s) are you using, if any?

DuckDB

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Jul 19 '24 17:07 jitingxu1

ibis ibis copied to clipboard

feat: optimize table info() and describe() for large column tables efficiently

Is your feature request related to a problem?

What is the motivation behind your request?

Describe the solution you'd like

What version of ibis are you running?

What backend(s) are you using, if any?

Code of Conduct

ibis
ibis copied to clipboard