ibis
ibis copied to clipboard
feat: Make string columns be dtype "string", not object, after execute()
Is your feature request related to a problem?
If a table has any string columns, those columns in table.execute() are of dtype object, not "string".
Describe the solution you'd like
They should be dtype "string". This would be breaking for some people I'm sure. But we need to move to the newer pandas dtypes.
What version of ibis are you running?
4.0.0
What backend(s) are you using, if any?
duckdb
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Well, we've got 5.0.0 coming up, seems like as good a time as any to make this change!
@cpcloud relatedly, int64 columns with nulls are turned into floats when executed into pandas dataframes. Really those should be pd.Int64(). And similar for all other nullable int dtypes.
Do you have any idea how widespread the use of these types is?
TBH I haven't followed whether these nullable types are now the common case or not.
nullables are not the default in pandas yet
2.0 is adding an option for all the I/O routines (configurable with a local or global option) to return nullables as well
we have 2.0 coming out soon i would certainly consider defaulting these types for 2.0 pandas installs (so prob wait / have an option)
Thanks @jreback!
I think we can do the following:
- Add experimental support for this in 5.0, with an option
use_nullable_dtypes
(?) defaulting toFalse
- If all goes well, switch that option to
True
for 6.0
@NickCrews @jreback Thoughts?
sgtm
Yeah sgtm.
One tricky case: consider an int64 column: if it has nulls then it should be turned into an pd.Int64(), but what if there are no nulls? Should it be a simple np.int64, or still get upcast to the nullable type? Perhaps the flag shouldn't just be boolean, but "never"
/"as_needed"
/"always"
?
IDK if pandas already has an API for choosing how to do this conversion, but if we can mirror/leverage that API then it would reduce the impedance mismatch.
The issue with that kind of optionality is that has_nulls()
will introduce yet another scan of the data.
I think this is going to be an all or nothing option, and not that smart of an implementation (but correct!)
Cool, always going to nullable works for me
I can make a new issue for this if you ask, but I think I ran across another flavor of this today. Is this what we want the behavior to be, or should we make more assumptions about casting "object" in pandas to "string" in ibis?
import ibis
import pandas as pd
df = pd.DataFrame(
{
"string": pd.Series(["a", "b", "c"], dtype="string"),
"str": pd.Series(["a", "b", "c"], dtype="object"),
"all_null": pd.Series([None, None, None], dtype="object"),
"cat": pd.Series(["a", "b", "c"], dtype="category"),
}
)
# works
ibis.memtable(df.drop(columns="all_null")).execute()
# IbisTypeError: DuckDB cannot yet reliably handle `null` typed columns; got null typed columns: ['all_null']
ibis.memtable(df).execute()
# IbisTypeError: DuckDB cannot yet reliably handle `null` typed columns; got null typed columns: ['str', 'all_null']
ibis.memtable(df.iloc[:0]).execute()