ibis icon indicating copy to clipboard operation
ibis copied to clipboard

fix: manually casting the type of pos to Int16 for table.info()

Open jitingxu1 opened this issue 9 months ago • 2 comments

Description of changes

when I was working on a data science project, it has more than 128 columns in the table, when I run table.info() it throws the error

RelationError                             Traceback (most recent call last)
Cell In[95], line 1
----> 1 static[cols].info()

File /opt/conda/lib/python3.10/site-packages/ibis/expr/types/relations.py:2829, in Table.info(self)
   2817     agg = self.select(
   2818         isna=ibis.case().when(col.isnull(), 1).else_(0).end()
   2819     ).agg(
   (...)
   2826         pos=lit(pos),
   2827     )
   2828     aggs.append(agg)
-> 2829 return ibis.union(*aggs).order_by(ibis.asc("pos"))

File /opt/conda/lib/python3.10/site-packages/ibis/expr/api.py:1948, in union(table, distinct, *rest)
   1882 def union(table: ir.Table, *rest: ir.Table, distinct: bool = False) -> ir.Table:
   1883     """Compute the set union of multiple table expressions.
   1884 
   1885     The input tables must have identical schemas.
   (...)
   1946 
   1947     """
-> 1948     return table.union(*rest, distinct=distinct) if rest else table

File /opt/conda/lib/python3.10/site-packages/ibis/expr/types/relations.py:1751, in Table.union(self, table, distinct, *rest)
   1749 node = ops.Union(self, table, distinct=distinct)
   1750 for table in rest:
-> 1751     node = ops.Union(node, table, distinct=distinct)
   1752 return node.to_expr().select(self.columns)

File /opt/conda/lib/python3.10/site-packages/ibis/common/bases.py:72, in AbstractMeta.__call__(cls, *args, **kwargs)
     52 def __call__(cls, *args, **kwargs):
     53     """Create a new instance of the class.
     54 
     55     The subclass may override the `__create__` classmethod to change the
   (...)
     70 
     71     """
---> 72     return cls.__create__(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/ibis/common/grounds.py:120, in Annotable.__create__(cls, *args, **kwargs)
    116 @classmethod
    117 def __create__(cls, *args: Any, **kwargs: Any) -> Self:
    118     # construct the instance by passing only validated keyword arguments
    119     kwargs = cls.__signature__.validate(cls, args, kwargs)
--> 120     return super().__create__(**kwargs)

File /opt/conda/lib/python3.10/site-packages/ibis/expr/operations/relations.py:293, in Set.__init__(self, left, right, **kwargs)
    290 def __init__(self, left, right, **kwargs):
    291     # convert to dictionary first, to get key-unordered comparison semantics
    292     if dict(left.schema) != dict(right.schema):
--> 293         raise RelationError("Table schemas must be equal for set operations")
    294     elif left.schema.names != right.schema.names:
    295         # rewrite so that both sides have the columns in the same order making it
    296         # easier for the backends to implement set operations
    297         cols = {name: Field(right, name) for name in left.schema.names}

RelationError: Table schemas must be equal for set operations

Here is the code could be used to reproduce the error,

 import ibis
   ...: ibis.options.interactive = True
   ...: df = {
   ...:     f"col_{i}": [1, 0] for i in range(129)
   ...: }
   ...: print(len(df.keys()))
   ...:  d = ibis.memtable(df)
   ...: d.info()

The schemas when union are different, The type of pos is int8 for the first 128 columns, it will be int16 after that. I cast all into int16 which is large enough for lots of columns.

dict(left.schema) = {
'name': String(nullable=True), 
'type': String(nullable=True), 
'nullable': Boolean(nullable=True),
 'nulls': Int64(nullable=True), 
'non_nulls': Int64(nullable=True), 
'null_frac': Float64(nullable=True), 
'pos': Int8(nullable=True)
}

dict(right.schema) = {
'name': String(nullable=True), 
'type': String(nullable=True), 
'nullable': Boolean(nullable=True), 
'nulls': Int64(nullable=True), 
'non_nulls': Int64(nullable=True), 
'null_frac': Float64(nullable=True), 
###### int16
'pos': Int16(nullable=True)
}

Issues closed

jitingxu1 avatar May 07 '24 01:05 jitingxu1

ACTION NEEDED

Ibis follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message.

Please update your PR title and description to match the specification.

github-actions[bot] avatar May 07 '24 01:05 github-actions[bot]

This PR needs a test. Something like:

  1. Create a table with > 127 columns
  2. Call the info() method
  3. Make some kind of assertion

tried to add test like this, the code could work when num_cols is small.

@pytest.mark.notyet(
    ["druid"],
    raises=PyDruidProgrammingError,
    reason="Druid only supports trivial unions",
)
def test_table_info_wider(con, backend):
    num_cols = 129
    col_names = [f"col_{i}"for i in range(num_cols)]
    t = ibis.memtable(
        {col: [0, 1] for col in col_names}
    )
    expr = t.info()
    result = expr.execute()
    assert list(result.name) == col_names

but got the RecursionError:when I have 129 columns, not sure if I create the table correctly, I did tried con.create_table() and ibis.memtable() Do you have an idea on this?

Here is the log

        expr = t.info()
>       result = expr.execute()

ibis/backends/tests/test_generic.py:633: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
ibis/expr/types/core.py:393: in execute
    return self._find_backend(use_default=True).execute(
ibis/backends/duckdb/__init__.py:1355: in execute
    table = self._to_duckdb_relation(expr, params=params, limit=limit).arrow()
ibis/backends/duckdb/__init__.py:1288: in _to_duckdb_relation
    sql = self.compile(table_expr, limit=limit, params=params)
ibis/backends/sql/__init__.py:178: in compile
    sql = query.sql(dialect=self.dialect, pretty=pretty, copy=False)
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/expressions.py:561: in sql
    return Dialect.get_or_raise(dialect).generate(self, **opts)
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/dialects/dialect.py:498: in generate
    return self.generator(**opts).generate(expression, copy=copy)
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:583: in generate
    sql = self.sql(expression).strip()
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:739: in sql
    sql = getattr(self, exp_handler_name)(expression)
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:2230: in select_sql
    self.sql(expression, "from", comment=False),
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:739: in sql
    sql = getattr(self, exp_handler_name)(expression)
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:2316: in union_sql
    return self.set_operations(expression)
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:2309: in set_operations
    sqls.append(self.sql(node))
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:739: in sql
    sql = getattr(self, exp_handler_name)(expression)
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:2230: in select_sql
    self.sql(expression, "from", comment=False),
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:728: in sql
    return self.sql(value)
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:739: in sql
    sql = getattr(self, exp_handler_name)(expression)
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:1800: in from_sql
    return f"{self.seg('FROM')} {self.sql(expression, 'this')}"
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:728: in sql
    return self.sql(value)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <sqlglot.dialects.duckdb.DuckDB.Generator object at 0x154de9910>
expression = Subquery(
  this=Union(
    this=Select(
      expressions=[
        Star()],
      from=From(
        this=Subquery(
...          alias=Identifier(this=t203, quoted=True)))),
    distinct=False),
  alias=Identifier(this=t313, quoted=True))
key = None, comment = True

    def sql(
        self,
        expression: t.Optional[str | exp.Expression],
        key: t.Optional[str] = None,
        comment: bool = True,
    ) -> str:
        if not expression:
            return ""
    
        if isinstance(expression, str):
            return expression
    
        if key:
            value = expression.args.get(key)
            if value:
                return self.sql(value)
            return ""
    
        transform = self.TRANSFORMS.get(expression.__class__)
    
>       if callable(transform):
E       RecursionError: maximum recursion depth exceeded while calling a Python object

/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:733: RecursionError
!!! Recursion error detected, but an error occurred locating the origin of recursion.
  The following exception happened when comparing locals in the stack frame:
    RecursionError: maximum recursion depth exceeded
  Displaying first and last 10 stack frames out of 964.
========================================================================================================== short test summary info ===========================================================================================================
FAILED ibis/backends/tests/test_generic.py::test_table_info_wider[duckdb] - RecursionError: maximum recursion depth exceeded while calling a Python object

jitingxu1 avatar May 09 '24 00:05 jitingxu1

I was waiting for the #9194

Now added the unit test, ping @cpcloud for review.

jitingxu1 avatar May 31 '24 21:05 jitingxu1

Seems like flink can't handle the new test. Can you mark it as notyet or broken?

cpcloud avatar Jun 01 '24 13:06 cpcloud

Tests are still failing for Flink.

Could we re run the test? The logs:

Caused by: java.io.IOException: Insufficient number of network buffers: required 512, but only 348 available. The total number of network buffers is currently set to 20316 of 32768 bytes each. You can increase this number by setting the configuration keys 'taskmanager.memory.network.fraction', 'taskmanager.memory.network.min', and 'taskmanager.memory.network.max'.

jitingxu1 avatar Jun 03 '24 17:06 jitingxu1

Not really, because it's a repeatable failure.

It's almost certainly related to the "large" number of columns (not actually that large).

Can you instead xfail the test for now?

cpcloud avatar Jun 03 '24 17:06 cpcloud