ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat: Make string columns be dtype "string", not object, after execute()

Open NickCrews opened this issue 2 years ago • 10 comments

Is your feature request related to a problem?

If a table has any string columns, those columns in table.execute() are of dtype object, not "string".

Describe the solution you'd like

They should be dtype "string". This would be breaking for some people I'm sure. But we need to move to the newer pandas dtypes.

What version of ibis are you running?

4.0.0

What backend(s) are you using, if any?

duckdb

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

NickCrews avatar Jan 24 '23 22:01 NickCrews

Well, we've got 5.0.0 coming up, seems like as good a time as any to make this change!

cpcloud avatar Jan 24 '23 22:01 cpcloud

@cpcloud relatedly, int64 columns with nulls are turned into floats when executed into pandas dataframes. Really those should be pd.Int64(). And similar for all other nullable int dtypes.

NickCrews avatar Jan 27 '23 02:01 NickCrews

Do you have any idea how widespread the use of these types is?

TBH I haven't followed whether these nullable types are now the common case or not.

cpcloud avatar Jan 27 '23 02:01 cpcloud

nullables are not the default in pandas yet

2.0 is adding an option for all the I/O routines (configurable with a local or global option) to return nullables as well

we have 2.0 coming out soon i would certainly consider defaulting these types for 2.0 pandas installs (so prob wait / have an option)

jreback avatar Jan 27 '23 02:01 jreback

Thanks @jreback!

I think we can do the following:

  1. Add experimental support for this in 5.0, with an option use_nullable_dtypes (?) defaulting to False
  2. If all goes well, switch that option to True for 6.0

@NickCrews @jreback Thoughts?

cpcloud avatar Jan 27 '23 12:01 cpcloud

sgtm

jreback avatar Jan 27 '23 12:01 jreback

Yeah sgtm.

One tricky case: consider an int64 column: if it has nulls then it should be turned into an pd.Int64(), but what if there are no nulls? Should it be a simple np.int64, or still get upcast to the nullable type? Perhaps the flag shouldn't just be boolean, but "never"/"as_needed"/"always"?

IDK if pandas already has an API for choosing how to do this conversion, but if we can mirror/leverage that API then it would reduce the impedance mismatch.

NickCrews avatar Jan 27 '23 17:01 NickCrews

The issue with that kind of optionality is that has_nulls() will introduce yet another scan of the data.

I think this is going to be an all or nothing option, and not that smart of an implementation (but correct!)

cpcloud avatar Jan 27 '23 18:01 cpcloud

Cool, always going to nullable works for me

NickCrews avatar Jan 27 '23 18:01 NickCrews

I can make a new issue for this if you ask, but I think I ran across another flavor of this today. Is this what we want the behavior to be, or should we make more assumptions about casting "object" in pandas to "string" in ibis?

import ibis
import pandas as pd

df = pd.DataFrame(
    {
        "string": pd.Series(["a", "b", "c"], dtype="string"),
        "str": pd.Series(["a", "b", "c"], dtype="object"),
        "all_null": pd.Series([None, None, None], dtype="object"),
        "cat": pd.Series(["a", "b", "c"], dtype="category"),
    }
)
# works
ibis.memtable(df.drop(columns="all_null")).execute()
# IbisTypeError: DuckDB cannot yet reliably handle `null` typed columns; got null typed columns: ['all_null']
ibis.memtable(df).execute()
# IbisTypeError: DuckDB cannot yet reliably handle `null` typed columns; got null typed columns: ['str', 'all_null']
ibis.memtable(df.iloc[:0]).execute()

NickCrews avatar Jan 22 '24 20:01 NickCrews