ibis icon indicating copy to clipboard operation
ibis copied to clipboard

Postgres bug in creating dataframe with field types of date

Open robdmc opened this issue 1 year ago • 6 comments

What happened?

I'm querying a table with a schema like this

 ibis.Schema {
  a   !int64
  b   !timestamp('UTC')
  c   !date
  d    string
  e    string
  f    !string
  g   !decimal(6, 2)
  h   !int64
  i     string
}

I then do something like

connection.postgres.tables.my_table.execute()

And I get the following error

ValueError: Unexpected value for 'dtype': 'datetime64[D]'. Must be 'datetime64[s]', 'datetime64[ms]', 'datetime64[us]', 'datetime64[ns]' or DatetimeTZDtype'.

As always, happy to provide full query in all it's glory via DM on Zulip

What version of ibis are you using?

'8.0.0'

What backend(s) are you using, if any?

postgres

Relevant log output

File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/ibis/expr/types/core.py:324, in Expr.execute(self, limit, timecontext, params, **kwargs)
    297 def execute(
    298     self,
    299     limit: int | str | None = "default",
   (...)
    302     **kwargs: Any,
    303 ):
    304     """Execute an expression against its backend if one exists.
    305 
    306     Parameters
   (...)
    322         Keyword arguments
    323     """
--> 324     return self._find_backend(use_default=True).execute(
    325         self, limit=limit, timecontext=timecontext, params=params, **kwargs
    326     )

File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/ibis/backends/base/sql/__init__.py:343, in BaseSQLBackend.execute(self, expr, params, limit, **kwargs)
    340 schema = expr.as_table().schema()
    342 with self._safe_raw_sql(sql, **kwargs) as cursor:
--> 343     result = self.fetch_from_cursor(cursor, schema)
    345 return expr.__pandas_result__(result)

File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/ibis/backends/base/sql/alchemy/__init__.py:243, in BaseAlchemyBackend.fetch_from_cursor(self, cursor, schema)
    241     cursor.close()
    242     raise
--> 243 df = PandasData.convert_table(df, schema)
    244 if not df.empty and geospatial_supported:
    245     return self._to_geodataframe(df, schema)

File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/ibis/formats/pandas.py:118, in PandasData.convert_table(cls, df, schema)
    113     raise ValueError(
    114         "schema column count does not match input data column count"
    115     )
    117 for (name, series), dtype in zip(df.items(), schema.types):
--> 118     df[name] = cls.convert_column(series, dtype)
    120 # return data with the schema's columns which may be different than the
    121 # input columns
    122 df.columns = schema.names

File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/ibis/formats/pandas.py:135, in PandasData.convert_column(cls, obj, dtype)
    132 method_name = f"convert_{dtype.__class__.__name__}"
    133 convert_method = getattr(cls, method_name, cls.convert_default)
--> 135 result = convert_method(obj, dtype, pandas_type)
    136 assert not isinstance(result, np.ndarray), f"{convert_method} -> {type(result)}"
    137 return result

File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/ibis/formats/pandas.py:201, in PandasData.convert_Date(cls, s, dtype, pandas_type)
    199     s = s.dt.tz_convert("UTC").dt.tz_localize(None)
    200 try:
--> 201     return s.astype(pandas_type).dt.date
    202 except (TypeError, pd._libs.tslibs.OutOfBoundsDatetime):
    204     def try_date(v):

File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/pandas/core/generic.py:6637, in NDFrame.astype(self, dtype, copy, errors)
   6631     results = [
   6632         ser.astype(dtype, copy=copy, errors=errors) for _, ser in self.items()
   6633     ]
   6635 else:
   6636     # else, only a single dtype is given
-> 6637     new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   6638     res = self._constructor_from_mgr(new_data, axes=new_data.axes)
   6639     return res.__finalize__(self, method="astype")

File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/pandas/core/internals/managers.py:431, in BaseBlockManager.astype(self, dtype, copy, errors)
    428 elif using_copy_on_write():
    429     copy = False
--> 431 return self.apply(
    432     "astype",
    433     dtype=dtype,
    434     copy=copy,
    435     errors=errors,
    436     using_cow=using_copy_on_write(),
    437 )

File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/pandas/core/internals/managers.py:364, in BaseBlockManager.apply(self, f, align_keys, **kwargs)
    362         applied = b.apply(f, **kwargs)
    363     else:
--> 364         applied = getattr(b, f)(**kwargs)
    365     result_blocks = extend_blocks(applied, result_blocks)
    367 out = type(self).from_blocks(result_blocks, self.axes)

File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/pandas/core/internals/blocks.py:758, in Block.astype(self, dtype, copy, errors, using_cow, squeeze)
    755         raise ValueError("Can not squeeze with more than one column.")
    756     values = values[0, :]  # type: ignore[call-overload]
--> 758 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
    760 new_values = maybe_coerce_values(new_values)
    762 refs = None

File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/pandas/core/dtypes/astype.py:237, in astype_array_safe(values, dtype, copy, errors)
    234     dtype = dtype.numpy_dtype
    236 try:
--> 237     new_values = astype_array(values, dtype, copy=copy)
    238 except (ValueError, TypeError):
    239     # e.g. _astype_nansafe can fail on object-dtype of strings
    240     #  trying to convert to float
    241     if errors == "ignore":

File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/pandas/core/dtypes/astype.py:182, in astype_array(values, dtype, copy)
    179     values = values.astype(dtype, copy=copy)
    181 else:
--> 182     values = _astype_nansafe(values, dtype, copy=copy)
    184 # in pandas we don't store numpy str dtypes, so convert to object
    185 if isinstance(dtype, np.dtype) and issubclass(values.dtype.type, str):

File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/pandas/core/dtypes/astype.py:110, in _astype_nansafe(arr, dtype, copy, skipna)
    107 if lib.is_np_dtype(dtype, "M"):
    108     from pandas.core.arrays import DatetimeArray
--> 110     dta = DatetimeArray._from_sequence(arr, dtype=dtype)
    111     return dta._ndarray
    113 elif lib.is_np_dtype(dtype, "m"):

File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/pandas/core/arrays/datetimes.py:327, in DatetimeArray._from_sequence(cls, scalars, dtype, copy)
    325 @classmethod
    326 def _from_sequence(cls, scalars, *, dtype=None, copy: bool = False):
--> 327     return cls._from_sequence_not_strict(scalars, dtype=dtype, copy=copy)

File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/pandas/core/arrays/datetimes.py:354, in DatetimeArray._from_sequence_not_strict(cls, data, dtype, copy, tz, freq, dayfirst, yearfirst, ambiguous)
    351 else:
    352     tz = timezones.maybe_get_tz(tz)
--> 354 dtype = _validate_dt64_dtype(dtype)
    355 # if dtype has an embedded tz, capture it
    356 tz = _validate_tz_from_dtype(dtype, tz, explicit_tz_none)

File ~/mambaforge/envs/ai_2024_02_05a/lib/python3.10/site-packages/pandas/core/arrays/datetimes.py:2550, in _validate_dt64_dtype(dtype)
   2544     raise ValueError(msg)
   2546 if (
   2547     isinstance(dtype, np.dtype)
   2548     and (dtype.kind != "M" or not is_supported_dtype(dtype))
   2549 ) or not isinstance(dtype, (np.dtype, DatetimeTZDtype)):
-> 2550     raise ValueError(
   2551         f"Unexpected value for 'dtype': '{dtype}'. "
   2552         "Must be 'datetime64[s]', 'datetime64[ms]', 'datetime64[us]', "
   2553         "'datetime64[ns]' or DatetimeTZDtype'."
   2554     )
   2556 if getattr(dtype, "tz", None):
   2557     # https://github.com/pandas-dev/pandas/issues/18595
   2558     # Ensure that we have a standard timezone for pytz objects.
   2559     # Without this, things like adding an array of timedeltas and
   2560     # a  tz-aware Timestamp (with a tz specific to its datetime) will
   2561     # be incorrect(ish?) for the array as a whole
   2562     dtype = cast(DatetimeTZDtype, dtype)

ValueError: Unexpected value for 'dtype': 'datetime64[D]'. Must be 'datetime64[s]', 'datetime64[ms]', 'datetime64[us]', 'datetime64[ns]' or DatetimeTZDtype'.

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

robdmc avatar Feb 07 '24 17:02 robdmc

This is likely a pandas version issue. Can you try downgrading to pandas 2.1?

We don't yet support pandas 2.2 because it contains some changes that require us to yet again handle date conversions and we haven't yet worked out all the details.

cpcloud avatar Feb 07 '24 17:02 cpcloud

You are correct. This problem goes away with pandas downgrade. I was able to circumvent the problem and remain in pandas 2.2 by explicitely casting those fields to datetime.

I'm good for now, so feel free to close/archive this bug if you'd like.

robdmc avatar Feb 07 '24 17:02 robdmc

I'll leave it open until we can upgrade our pandas support.

There's also a larger issue here with pandas versions, regarding how we should approach the many many ways there are to hold data in a pandas DataFrame or Series:

  • pyarrow
  • pandas nullable types
  • numpy types
  • any combination of the three

cpcloud avatar Feb 07 '24 17:02 cpcloud

@cpcloud I haven't played much with pandas outside the numpy types, so I'm not sure how everything is put together. One thing to keep in mind is that if users have spent time to add type-checking to dataframes (yours truly) then changing underlying types could potentially create a substantial maintenance burden on existing code-bases. This is not a reason to avoid making the best choices for the future of ibis, but it may be a reason to launch those changes from a really stable version of ibis that people can fall back to while upgrading their code bases.

Also -- probably more debt than you all want to hold -- but how bad would it be to have global ibis configuration for underlying pandas type system? Or maybe an execute() kwarg?

robdmc avatar Feb 19 '24 22:02 robdmc

but it may be a reason to launch those changes from a really stable version of ibis that people can fall back to while upgrading their code bases.

Yeah, that's a good point. We need to do a bit of planning here with input from the community to understand what might be the best path forward.

how bad would it be to have global ibis configuration for underlying pandas type system

Pretty bad 😅

There's at least three different pandas type systems if you consider them being used exclusively, but there's actually more: you can mix the type systems as you like. Maybe this is considered a feature, but it's a massive burden for downstream consumers of pandas to take on.

One problem is gaps in type system support. What should Ibis do in the case of nested types? I think folks are working on proper nested type support via pyarrow, without that support Ibis and other libraries are left to their own devices, probably using object dtypes. This makes integration with other libraries challenging.

So far the most promising idea is that all execution first goes through pyarrow, and then whatever comes out of pa.Table.to_pandas() on that result is what we would return from execute. Perhaps passing the pyarrow Table type mapping dict would be a thing we could do, to allow people to configure the return types if they absolutely must.

cpcloud avatar Feb 20 '24 13:02 cpcloud

Yikes! That sounds like a nightmare! Well, it sounds like you all are already pretty good of this, but putting thought into the the maintenance burden of your downstream users is always appreciated.

robdmc avatar Feb 20 '24 16:02 robdmc

Fixed by #8758.

cpcloud avatar Apr 15 '24 12:04 cpcloud