ibis
ibis copied to clipboard
Postgres bug in creating dataframe with field types of date
What happened?
I'm querying a table with a schema like this
ibis.Schema {
a !int64
b !timestamp('UTC')
c !date
d string
e string
f !string
g !decimal(6, 2)
h !int64
i string
}
I then do something like
connection.postgres.tables.my_table.execute()
And I get the following error
ValueError: Unexpected value for 'dtype': 'datetime64[D]'. Must be 'datetime64[s]', 'datetime64[ms]', 'datetime64[us]', 'datetime64[ns]' or DatetimeTZDtype'.
As always, happy to provide full query in all it's glory via DM on Zulip
What version of ibis are you using?
'8.0.0'
What backend(s) are you using, if any?
postgres
Relevant log output
File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/ibis/expr/types/core.py:324, in Expr.execute(self, limit, timecontext, params, **kwargs)
297 def execute(
298 self,
299 limit: int | str | None = "default",
(...)
302 **kwargs: Any,
303 ):
304 """Execute an expression against its backend if one exists.
305
306 Parameters
(...)
322 Keyword arguments
323 """
--> 324 return self._find_backend(use_default=True).execute(
325 self, limit=limit, timecontext=timecontext, params=params, **kwargs
326 )
File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/ibis/backends/base/sql/__init__.py:343, in BaseSQLBackend.execute(self, expr, params, limit, **kwargs)
340 schema = expr.as_table().schema()
342 with self._safe_raw_sql(sql, **kwargs) as cursor:
--> 343 result = self.fetch_from_cursor(cursor, schema)
345 return expr.__pandas_result__(result)
File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/ibis/backends/base/sql/alchemy/__init__.py:243, in BaseAlchemyBackend.fetch_from_cursor(self, cursor, schema)
241 cursor.close()
242 raise
--> 243 df = PandasData.convert_table(df, schema)
244 if not df.empty and geospatial_supported:
245 return self._to_geodataframe(df, schema)
File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/ibis/formats/pandas.py:118, in PandasData.convert_table(cls, df, schema)
113 raise ValueError(
114 "schema column count does not match input data column count"
115 )
117 for (name, series), dtype in zip(df.items(), schema.types):
--> 118 df[name] = cls.convert_column(series, dtype)
120 # return data with the schema's columns which may be different than the
121 # input columns
122 df.columns = schema.names
File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/ibis/formats/pandas.py:135, in PandasData.convert_column(cls, obj, dtype)
132 method_name = f"convert_{dtype.__class__.__name__}"
133 convert_method = getattr(cls, method_name, cls.convert_default)
--> 135 result = convert_method(obj, dtype, pandas_type)
136 assert not isinstance(result, np.ndarray), f"{convert_method} -> {type(result)}"
137 return result
File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/ibis/formats/pandas.py:201, in PandasData.convert_Date(cls, s, dtype, pandas_type)
199 s = s.dt.tz_convert("UTC").dt.tz_localize(None)
200 try:
--> 201 return s.astype(pandas_type).dt.date
202 except (TypeError, pd._libs.tslibs.OutOfBoundsDatetime):
204 def try_date(v):
File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/pandas/core/generic.py:6637, in NDFrame.astype(self, dtype, copy, errors)
6631 results = [
6632 ser.astype(dtype, copy=copy, errors=errors) for _, ser in self.items()
6633 ]
6635 else:
6636 # else, only a single dtype is given
-> 6637 new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
6638 res = self._constructor_from_mgr(new_data, axes=new_data.axes)
6639 return res.__finalize__(self, method="astype")
File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/pandas/core/internals/managers.py:431, in BaseBlockManager.astype(self, dtype, copy, errors)
428 elif using_copy_on_write():
429 copy = False
--> 431 return self.apply(
432 "astype",
433 dtype=dtype,
434 copy=copy,
435 errors=errors,
436 using_cow=using_copy_on_write(),
437 )
File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/pandas/core/internals/managers.py:364, in BaseBlockManager.apply(self, f, align_keys, **kwargs)
362 applied = b.apply(f, **kwargs)
363 else:
--> 364 applied = getattr(b, f)(**kwargs)
365 result_blocks = extend_blocks(applied, result_blocks)
367 out = type(self).from_blocks(result_blocks, self.axes)
File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/pandas/core/internals/blocks.py:758, in Block.astype(self, dtype, copy, errors, using_cow, squeeze)
755 raise ValueError("Can not squeeze with more than one column.")
756 values = values[0, :] # type: ignore[call-overload]
--> 758 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
760 new_values = maybe_coerce_values(new_values)
762 refs = None
File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/pandas/core/dtypes/astype.py:237, in astype_array_safe(values, dtype, copy, errors)
234 dtype = dtype.numpy_dtype
236 try:
--> 237 new_values = astype_array(values, dtype, copy=copy)
238 except (ValueError, TypeError):
239 # e.g. _astype_nansafe can fail on object-dtype of strings
240 # trying to convert to float
241 if errors == "ignore":
File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/pandas/core/dtypes/astype.py:182, in astype_array(values, dtype, copy)
179 values = values.astype(dtype, copy=copy)
181 else:
--> 182 values = _astype_nansafe(values, dtype, copy=copy)
184 # in pandas we don't store numpy str dtypes, so convert to object
185 if isinstance(dtype, np.dtype) and issubclass(values.dtype.type, str):
File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/pandas/core/dtypes/astype.py:110, in _astype_nansafe(arr, dtype, copy, skipna)
107 if lib.is_np_dtype(dtype, "M"):
108 from pandas.core.arrays import DatetimeArray
--> 110 dta = DatetimeArray._from_sequence(arr, dtype=dtype)
111 return dta._ndarray
113 elif lib.is_np_dtype(dtype, "m"):
File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/pandas/core/arrays/datetimes.py:327, in DatetimeArray._from_sequence(cls, scalars, dtype, copy)
325 @classmethod
326 def _from_sequence(cls, scalars, *, dtype=None, copy: bool = False):
--> 327 return cls._from_sequence_not_strict(scalars, dtype=dtype, copy=copy)
File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/pandas/core/arrays/datetimes.py:354, in DatetimeArray._from_sequence_not_strict(cls, data, dtype, copy, tz, freq, dayfirst, yearfirst, ambiguous)
351 else:
352 tz = timezones.maybe_get_tz(tz)
--> 354 dtype = _validate_dt64_dtype(dtype)
355 # if dtype has an embedded tz, capture it
356 tz = _validate_tz_from_dtype(dtype, tz, explicit_tz_none)
File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/pandas/core/arrays/datetimes.py:2550, in _validate_dt64_dtype(dtype)
2544 raise ValueError(msg)
2546 if (
2547 isinstance(dtype, np.dtype)
2548 and (dtype.kind != "M" or not is_supported_dtype(dtype))
2549 ) or not isinstance(dtype, (np.dtype, DatetimeTZDtype)):
-> 2550 raise ValueError(
2551 f"Unexpected value for 'dtype': '{dtype}'. "
2552 "Must be 'datetime64[s]', 'datetime64[ms]', 'datetime64[us]', "
2553 "'datetime64[ns]' or DatetimeTZDtype'."
2554 )
2556 if getattr(dtype, "tz", None):
2557 # https://github.com/pandas-dev/pandas/issues/18595
2558 # Ensure that we have a standard timezone for pytz objects.
2559 # Without this, things like adding an array of timedeltas and
2560 # a tz-aware Timestamp (with a tz specific to its datetime) will
2561 # be incorrect(ish?) for the array as a whole
2562 dtype = cast(DatetimeTZDtype, dtype)
ValueError: Unexpected value for 'dtype': 'datetime64[D]'. Must be 'datetime64[s]', 'datetime64[ms]', 'datetime64[us]', 'datetime64[ns]' or DatetimeTZDtype'.
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
This is likely a pandas version issue. Can you try downgrading to pandas 2.1?
We don't yet support pandas 2.2 because it contains some changes that require us to yet again handle date conversions and we haven't yet worked out all the details.
You are correct. This problem goes away with pandas downgrade. I was able to circumvent the problem and remain in pandas 2.2 by explicitely casting those fields to datetime.
I'm good for now, so feel free to close/archive this bug if you'd like.
I'll leave it open until we can upgrade our pandas support.
There's also a larger issue here with pandas versions, regarding how we should approach the many many ways there are to hold data in a pandas DataFrame or Series:
- pyarrow
- pandas nullable types
- numpy types
- any combination of the three
@cpcloud I haven't played much with pandas outside the numpy types, so I'm not sure how everything is put together. One thing to keep in mind is that if users have spent time to add type-checking to dataframes (yours truly) then changing underlying types could potentially create a substantial maintenance burden on existing code-bases. This is not a reason to avoid making the best choices for the future of ibis, but it may be a reason to launch those changes from a really stable version of ibis that people can fall back to while upgrading their code bases.
Also -- probably more debt than you all want to hold -- but how bad would it be to have global ibis configuration for underlying pandas type system? Or maybe an execute() kwarg?
but it may be a reason to launch those changes from a really stable version of ibis that people can fall back to while upgrading their code bases.
Yeah, that's a good point. We need to do a bit of planning here with input from the community to understand what might be the best path forward.
how bad would it be to have global ibis configuration for underlying pandas type system
Pretty bad 😅
There's at least three different pandas type systems if you consider them being used exclusively, but there's actually more: you can mix the type systems as you like. Maybe this is considered a feature, but it's a massive burden for downstream consumers of pandas to take on.
One problem is gaps in type system support. What should Ibis do in the case of nested types? I think folks are working on proper nested type support via pyarrow, without that support Ibis and other libraries are left to their own devices, probably using object dtypes. This makes integration with other libraries challenging.
So far the most promising idea is that all execution first goes through pyarrow, and then whatever comes out of pa.Table.to_pandas() on that result is what we would return from execute. Perhaps passing the pyarrow Table type mapping dict would be a thing we could do, to allow people to configure the return types if they absolutely must.
Yikes! That sounds like a nightmare! Well, it sounds like you all are already pretty good of this, but putting thought into the the maintenance burden of your downstream users is always appreciated.
Fixed by #8758.