ibis
ibis copied to clipboard
feat: optimize table info() and describe() for large column tables efficiently
Is your feature request related to a problem?
we have table.info() and table.describe()
for Ibis table. The function loops over each column and performs multiple aggregations and form the output by unioning all the stats for each column
it is often used for univariate analysis by DS, When I tried these two functions on a bigger dataset (465 columns, 1.5m rows), I have two issues:
- It took forever to get the results for the
table.info()
, -
describe()
throws the Out of Memory exception
The info()
majorly calculate the sum, mean of null rows, but it is very slow. Is it very expensive to calculate this on duckDB?
for pos, colname in enumerate(self.columns):
col = self[colname]
typ = col.type()
agg = self.select(
isna=ibis.case().when(col.isnull(), 1).else_(0).end()
).agg(
name=lit(colname),
type=lit(str(typ)),
nullable=lit(typ.nullable),
nulls=lambda t: t.isna.sum(),
non_nulls=lambda t: (1 - t.isna).sum(),
null_frac=lambda t: t.isna.mean(),
pos=lit(pos, type=dt.int16),
)
aggs.append(agg)
return ibis.union(*aggs).order_by(ibis.asc("pos"))
The describe() throws Out of Memory exception, it will generate 465 small tables for the union, not sure why it takes so much memory.
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/ibis/expr/types/core.py:9 │
│ 9 in __rich_console__ │
│ │
│ 96 │ │ │
│ 97 │ │ try: │
│ 98 │ │ │ if opts.interactive: │
│ ❱ 99 │ │ │ │ rich_object = to_rich(self, console_width=console_width) │
│ 100 │ │ │ else: │
│ 101 │ │ │ │ rich_object = Text(self._noninteractive_repr()) │
│ 102 │ │ except Exception as e: │
│ │
│ /Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/ibis/expr/types/pretty.py │
│ :273 in to_rich │
│ │
│ 270 │ │ │ expr, max_length=max_length, max_string=max_string, max_depth=max_depth │
│ 271 │ │ ) │
│ 272 │ else: │
│ ❱ 273 │ │ return _to_rich_table( │
│ 274 │ │ │ expr, │
│ 275 │ │ │ max_rows=max_rows, │
│ 276 │ │ │ max_columns=max_columns, │
│ │
│ /Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/ibis/expr/types/pretty.py │
│ :345 in _to_rich_table │
│ │
│ 342 │ │ if orig_ncols > len(computed_cols): │
│ 343 │ │ │ table = table.select(*computed_cols) │
│ 344 │ │
│ ❱ 345 │ result = table.limit(max_rows + 1).to_pyarrow() │
│ 346 │ # Now format the columns in order, stopping if the console width would │
│ 347 │ # be exceeded. │
│ 348 │ col_info = [] │
│ │
│ /Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/ibis/expr/types/core.py:4 │
│ 86 in to_pyarrow │
│ │
│ 483 │ │ Table │
│ 484 │ │ │ A pyarrow table holding the results of the executed expression. │
│ 485 │ │ """ │
│ ❱ 486 │ │ return self._find_backend(use_default=True).to_pyarrow( │
│ 487 │ │ │ self, params=params, limit=limit, **kwargs │
│ 488 │ │ ) │
│ 489 │
│ │
│ /Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/ibis/backends/duckdb/__in │
│ it__.py:1375 in to_pyarrow │
│ │
│ 1372 │ │ limit: int | str | None = None, │
│ 1373 │ │ **_: Any, │
│ 1374 │ ) -> pa.Table: │
│ ❱ 1375 │ │ table = self._to_duckdb_relation(expr, params=params, limit=limit).arrow() │
│ 1376 │ │ return expr.__pyarrow_result__(table) │
│ 1377 │ │
│ 1378 │ def execute( │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryException: Out of Memory Error: failed to offload data block of size 256.0 KiB (19.6 GiB/19.6 GiB used).
This limit was set by the 'max_temp_directory_size' setting.
By default, this setting utilizes the available disk space on the drive where the 'temp_directory' is located.
You can adjust this setting, by using (for example) PRAGMA max_temp_directory_size='10GiB'
What is the motivation behind your request?
These two functions are very useful for univariate analysis.
Describe the solution you'd like
Could we do some batch or parallel computing?
What version of ibis are you running?
9.1.0
What backend(s) are you using, if any?
DuckDB
Code of Conduct
- [X] I agree to follow this project's Code of Conduct