ibis icon indicating copy to clipboard operation
ibis copied to clipboard

bug: arrow type error when show data with 'UUID' object

Open jitingxu1 opened this issue 1 year ago • 16 comments

What happened?

Issue 1: duckdb will produce different uuid for each row, but same uuid generated by sqlite, there maybe other backends have the same issue.

import ibis
ibis.options.interactive = True
from ibis.expr.api import row_number, uuid, now, pi

ibis.set_backend("sqlite")
t = ibis.examples.penguins.fetch()
t.mutate(uuid=ibis.uuid()).to_pandas()
image

Issue 2: get ArrowTypeError when show data:

import ibis
ibis.options.interactive = True
from ibis.expr.api import row_number, uuid, now, pi

ibis.set_backend("sqlite")
t = ibis.examples.penguins.fetch()
t1 = t.mutate(uuid=uuid())
t1[t1.my_uuid].head()

Got the following error:

Out[7]: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/voltrondata/repos/ibis/ibis/expr/types/relations.py:516 in __interactive_rich_console__   │
│                                                                                                  │
│    513 │   │   │   width = options.max_width                                                     │
│    514 │   │                                                                                     │
│    515 │   │   try:                                                                              │
│ ❱  516 │   │   │   table = to_rich_table(self, width)                                            │
│    517 │   │   except Exception as e:                                                            │
│    518 │   │   │   # In IPython exceptions inside of _repr_mimebundle_ are swallowed to          │
│    519 │   │   │   # allow calling several display functions and choosing to display             │
│                                                                                                  │
│ /Users/voltrondata/repos/ibis/ibis/expr/types/pretty.py:265 in to_rich_table                     │
│                                                                                                  │
│   262 │                                                                                          │
│   263 │   # Compute the data and return a pandas dataframe                                       │
│   264 │   nrows = ibis.options.repr.interactive.max_rows                                         │
│ ❱ 265 │   result = table.limit(nrows + 1).to_pyarrow()                                           │
│   266 │                                                                                          │
│   267 │   # Now format the columns in order, stopping if the console width would                 │
│   268 │   # be exceeded.                                                                         │
│                                                                                                  │
│ /Users/voltrondata/repos/ibis/ibis/expr/types/core.py:425 in to_pyarrow                          │
│                                                                                                  │
│   422 │   │   Table                                                                              │
│   423 │   │   │   A pyarrow table holding the results of the executed expression.                │
│   424 │   │   """                                                                                │
│ ❱ 425 │   │   return self._find_backend(use_default=True).to_pyarrow(                            │
│   426 │   │   │   self, params=params, limit=limit, **kwargs                                     │
│   427 │   │   )                                                                                  │
│   428                                                                                            │
│                                                                                                  │
│ /Users/voltrondata/repos/ibis/ibis/backends/__init__.py:218 in to_pyarrow                        │
│                                                                                                  │
│    215 │   │   table_expr = expr.as_table()                                                      │
│    216 │   │   schema = table_expr.schema()                                                      │
│    217 │   │   arrow_schema = schema.to_pyarrow()                                                │
│ ❱  218 │   │   with self.to_pyarrow_batches(                                                     │
│    219 │   │   │   table_expr, params=params, limit=limit, **kwargs                              │
│    220 │   │   ) as reader:                                                                      │
│    221 │   │   │   table = pa.Table.from_batches(reader, schema=arrow_schema)                    │
│                                                                                                  │
│ /Users/voltrondata/repos/ibis/ibis/backends/sqlite/__init__.py:264 in to_pyarrow_batches         │
│                                                                                                  │
│   261 │   │   │   self.compile(expr, limit=limit, params=params)                                 │
│   262 │   │   ) as cursor:                                                                       │
│   263 │   │   │   df = self._fetch_from_cursor(cursor, schema)                                   │
│ ❱ 264 │   │   table = pa.Table.from_pandas(                                                      │
│   265 │   │   │   df, schema=schema.to_pyarrow(), preserve_index=False                           │
│   266 │   │   )                                                                                  │
│   267 │   │   return table.to_reader(max_chunksize=chunk_size)                                   │
│                                                                                                  │
│ in pyarrow.lib.Table.from_pandas:3874                                                            │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/pyarrow/pandas_compat │
│ .py:611 in dataframe_to_arrays                                                                   │
│                                                                                                  │
│    608 │   │   │   │   issubclass(arr.dtype.type, np.integer))                                   │
│    609 │                                                                                         │
│    610 │   if nthreads == 1:                                                                     │
│ ❱  611 │   │   arrays = [convert_column(c, f)                                                    │
│    612 │   │   │   │     for c, f in zip(columns_to_convert, convert_fields)]                    │
│    613 │   else:                                                                                 │
│    614 │   │   arrays = []                                                                       │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/pyarrow/pandas_compat │
│ .py:611 in <listcomp>                                                                            │
│                                                                                                  │
│    608 │   │   │   │   issubclass(arr.dtype.type, np.integer))                                   │
│    609 │                                                                                         │
│    610 │   if nthreads == 1:                                                                     │
│ ❱  611 │   │   arrays = [convert_column(c, f)                                                    │
│    612 │   │   │   │     for c, f in zip(columns_to_convert, convert_fields)]                    │
│    613 │   else:                                                                                 │
│    614 │   │   arrays = []                                                                       │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/pyarrow/pandas_compat │
│ .py:598 in convert_column                                                                        │
│                                                                                                  │
│    595 │   │   │   │   pa.ArrowTypeError) as e:                                                  │
│    596 │   │   │   e.args += ("Conversion failed for column {!s} with type {!s}"                 │
│    597 │   │   │   │   │      .format(col.name, col.dtype),)                                     │
│ ❱  598 │   │   │   raise e                                                                       │
│    599 │   │   if not field_nullable and result.null_count > 0:                                  │
│    600 │   │   │   raise ValueError("Field {} was non-nullable but pandas column "               │
│    601 │   │   │   │   │   │   │    "had {} null values".format(str(field),                      │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/pyarrow/pandas_compat │
│ .py:592 in convert_column                                                                        │
│                                                                                                  │
│    589 │   │   │   type_ = field.type                                                            │
│    590 │   │                                                                                     │
│    591 │   │   try:                                                                              │
│ ❱  592 │   │   │   result = pa.array(col, type=type_, from_pandas=True, safe=safe)               │
│    593 │   │   except (pa.ArrowInvalid,                                                          │
│    594 │   │   │   │   pa.ArrowNotImplementedError,                                              │
│    595 │   │   │   │   pa.ArrowTypeError) as e:                                                  │
│                                                                                                  │
│ in pyarrow.lib.array:340                                                                         │
│                                                                                                  │
│ in pyarrow.lib._ndarray_to_array:86                                                              │
│                                                                                                  │
│ in pyarrow.lib.check_status:91                                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ArrowTypeError: ("Expected bytes, got a 'UUID' object", 'Conversion failed for column my_uuid with type
object')

it works well for to_pandas()

In [8]: t1[t1.my_uuid].to_pandas()
Out[8]:
                                  my_uuid
0    3f661a76-2d0e-4622-862e-1c4adcfd4813
1    3f661a76-2d0e-4622-862e-1c4adcfd4813
2    3f661a76-2d0e-4622-862e-1c4adcfd4813
3    3f661a76-2d0e-4622-862e-1c4adcfd4813
4    3f661a76-2d0e-4622-862e-1c4adcfd4813
..                                    ...
339  3f661a76-2d0e-4622-862e-1c4adcfd4813
340  3f661a76-2d0e-4622-862e-1c4adcfd4813
341  3f661a76-2d0e-4622-862e-1c4adcfd4813
342  3f661a76-2d0e-4622-862e-1c4adcfd4813
343  3f661a76-2d0e-4622-862e-1c4adcfd4813

What version of ibis are you using?

8.0.0

What backend(s) are you using, if any?

duckdb, sqlite

Relevant log output

No response

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

jitingxu1 avatar Mar 04 '24 00:03 jitingxu1

Thanks for opening this. Issue 1 should be fixed in #8535.

Issue 2 is due to the to_pyarrow conversion path in sqlite (and a few other backends) going dbapi row -> pandas -> pyarrow. When returning pandas dataframes from to_pandas we currently map a UUID column to an object dtype series of uuid.UUID objects (and these objects fail when converting to pyarrow). In contrast, for to_pyarrow we return a string column with the same data.

The easiest (and I think most consistent) fix would be to stop returning uuid columns in to_pandas as uuid.UUID values and instead treat them as strings. This matches what we do for both polars and pyarrow outputs. It's also more efficient for the user since they don't have an object dtype series in the output series.

cc @cpcloud for a :+1: / :-1: before I implement this fix.

jcrist avatar Mar 04 '24 04:03 jcrist

Eventually we should simplify the pandas output until .to_pyarrow().to_pandas() to offload all the conversion duties to arrow. So it is a +1 from me.

kszucs avatar Mar 04 '24 07:03 kszucs

Seems fine. I don't like that we have to do this but the alternative of implementing a custom pyarrow type seems less desirable than converting to strings.

cpcloud avatar Mar 04 '24 18:03 cpcloud

The repeated UUID issue has been addressed:

In [4]: import ibis
   ...: ibis.options.interactive = True
   ...: from ibis.expr.api import row_number, uuid, now, pi
   ...:
   ...: ibis.set_backend("sqlite")
   ...: t = ibis.examples.penguins.fetch()
   ...: t.mutate(uuid=ibis.uuid()).to_pandas()
Out[4]:
       species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex  year                                  uuid
0       Adelie  Torgersen            39.1           18.7              181.0       3750.0    male  2007  f3102c7e-167c-4854-af20-d3729580e2cc
1       Adelie  Torgersen            39.5           17.4              186.0       3800.0  female  2007  86afae37-f0e5-48d7-ba13-aa701374d4cd
2       Adelie  Torgersen            40.3           18.0              195.0       3250.0  female  2007  665cbe36-d7b7-4a7e-bd5f-ebf0c043c72f
3       Adelie  Torgersen             NaN            NaN                NaN          NaN    None  2007  a740b304-0a13-4f89-bdb4-fa9475f2daa4
4       Adelie  Torgersen            36.7           19.3              193.0       3450.0  female  2007  a8263a30-9cbb-4175-94e9-1429a6fdb0fa
..         ...        ...             ...            ...                ...          ...     ...   ...                                   ...
339  Chinstrap      Dream            55.8           19.8              207.0       4000.0    male  2009  112e7fcc-bf14-4177-96ee-526d9343c368
340  Chinstrap      Dream            43.5           18.1              202.0       3400.0  female  2009  c5964727-4dc0-42dd-8039-1527bd37b673
341  Chinstrap      Dream            49.6           18.2              193.0       3775.0    male  2009  a3d1a137-5847-4309-90c4-59d0f8fe35f9
342  Chinstrap      Dream            50.8           19.0              210.0       4100.0    male  2009  33da8b0b-d368-442c-ba05-44daa037b1e0
343  Chinstrap      Dream            50.2           18.7              198.0       3775.0  female  2009  de5b1031-de4a-4c13-85a6-920e4741f922

cpcloud avatar Aug 10 '24 15:08 cpcloud

I think there is a simple and self-contained solution to UUID types and any other type that does not have a 1:1 mapping between Ibis and PyArrow, such as MAC addresses, UUIDs, etc.

Let me know what you think. I am willing to add tests and make a PR if this approach makes sense.

This commit only checks for bijectivity between Ibis and PyArrow types, and if any column of the schema does not comply, then it is cast server-side.

There are some pros:

  • Ibis does not need to handle special cases backend by backend. PyArrowType mappings define this behavior for all backends that use PyArrow conversion.
  • It could easily extend to future types.
  • "Type-loss" conversions could be cast into other types that are not dt.string, although currently the only relevant ones are mapped to string.
  • Sequential expr.to_pyarrow().to_X() will be consistent with expr.to_X().
  • It does not require pandas types, so making Ibis more "pandas-agnostic" seems important based on #8901 conversations.
  • This affects only operations that previously would fail. It does not change any defined behavior.

Cons:

  • Without a warning, there is an implicit casting, but without any data loss.
  • It does not keep any metadata if the user converts them back to an Ibis table. I know there are some conversations about extending types with metadata, so I guess there is no "right way" to do this yet.
  • The cast is done server-side, which could be unexpected sometimes, but IMO this is a pro since Ibis is agnostic to the backend: if there is no explicit 1:1 conversion to a PyArrow type, then the backend's casting is the best (maybe only?) defined behavior we can rely on.

double-thinker avatar Sep 02 '24 19:09 double-thinker

@cpcloud Do you think the approach I developed is worthwhile? I can submit a PR in the following days if there's interest.

double-thinker avatar Sep 04 '24 13:09 double-thinker

At least for UUID, since Arrow has a canonical extension type for it, it seems like that would be a great way to maintain the type information.

kylebarron avatar Sep 16 '24 13:09 kylebarron

At least for UUID, since Arrow has a canonical extension type for it, it seems like that would be a great way to maintain the type information.

kylebarron avatar Sep 16 '24 13:09 kylebarron

Any progress on this?

james-roden avatar Mar 24 '25 14:03 james-roden

I think what you see here is as much progress as has happened.

It looks like pyarrow implemented the uuid extension in https://github.com/apache/arrow/pull/37298 in August 2023, which first appeared in arrow 18.0.0. So I would advocate for making all of ibis's .to_pyarrow() functions return this. This would be breaking. Of course, we currently have a (very low) lower bound on our pyarrow dependency of pyarrow>=10.0.1. So this begs the question: do we return strings for pyarrow>=10.0.1,<18.0.0, and uuid types for pyarrow>=18.0.0? I don't love it, but I don't want people on old pyarrow to be holding back everyone else from getting a good experience.

Also, someone has started a PR in pyarrow so that pa.array([uuid.UUID()]) works. So if that lands, then there is one less thing to worry about.

In the meantime before all this, I think adding a temp workaround for postgres and other backends that use the pandas -> arrow path (where we get this type error) so that users get strings (the same behavior as for duckdb etc) would be great.

NickCrews avatar Apr 04 '25 00:04 NickCrews

Is there a workaround that users can apply to get this working for now?

grv87 avatar Apr 21 '25 21:04 grv87

The only workaround that I have been doing is to use a duckdb backend, and then use ATTACH ... (TYPE POSTGRES), and then proxy all my queries through this duckdb backend. Not great at all lol, but for the limited stuff I am doing it works.

NickCrews avatar Apr 22 '25 15:04 NickCrews

Or, cast all the uuid columns to strings on the backend as a finale step before materializing.

NickCrews avatar Apr 23 '25 13:04 NickCrews

@NickCrews In the meantime, I think this could be a [good workaround](https://github.com/ibis-project/ibis/compare/main...double-thinker:ibis:fix/pyarrowtypes), even a sensible default for future type mismatches. If the lack of tests is what is stopping this from being merged, I can allocate time to do that if it will be merged.

I did not make a PR before because my comment had no answer. cc @cpcloud

Related PR: https://github.com/ibis-project/ibis/pull/11079

double-thinker avatar Jun 09 '25 07:06 double-thinker

Related: Duckdb, when materializing a uuid column with .arrow(), returns an array of pyarrow strings. I THINK they consider this a bug, and hopefully will change this behavior to return an array of pyarrow uuids.

So, I think actually I would like to change my desired behavior to the following, because it will be the most performant.

import uuid

import pandas as pd
import pyarrow as pa

byte_array = [uuid.uuid4().bytes, uuid.uuid4().bytes]

# EDIT: I think we can actaully write our own ~10-line PyarrowExtensionType class
# that implements the uuid datatype, so that we get the same result no matter which version
# of arrow the user is running.
pa_arr_pyarrow_18_and_above = pa.array(byte_array, type=pa.uuid())
pa_arr_pyarrow_17_and_below = pa.array(byte_array, type=pa.binary())
pd_series_pyarrow_18_and_above = pd.Series(byte_array, dtype=pd.ArrowDtype(pa.uuid()))
pd_series_pyarrow_17_and_below = pd.Series(byte_array, dtype=pd.ArrowDtype(pa.binary()))

It is not super easy to uuid strings such as 'e1ee3810-76e2-4291-b5d8-08bcba45be3e', so that is bummer for the users, but I think that is an ergonomics hit we should take for the performance. If they want us to take care of the conversion, they can do my_ibis_uuid_expr.cast(str).to_pyarrow(). For example, when repr()ing an expression with rich, I think we should do this, so users get the nice UUID string.

NickCrews avatar Jun 09 '25 18:06 NickCrews

The original issue of

same uuid generated by sqlite

was addressed previously when we changed the SQLite uuid UDF to be non-deterministic, which prevents SQLite from running a function once and reusing its result.

The secondary issue of UUIDs not working with PyArrow is not addressed yet.

I think the most reliable approach is to bring back UUIDs as bytes, always, until the PyArrow and DuckDB UUID stories are consistent and straightened out.

cpcloud avatar Jun 10 '25 13:06 cpcloud

I did get a custom uuid type working:

import uuid

import duckdb
import pandas as pd
import pyarrow as pa

print(pa.__version__)
print(duckdb.__version__)

byte_list = [uuid.uuid4().bytes, uuid.uuid4().bytes]


class UUIDType(pa.ExtensionType):
    def __init__(self):
        super().__init__(pa.binary(16), "arrow.uuid")

    def __arrow_ext_serialize__(self) -> bytes:
        # No parameters are necessary
        return b""

    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized):
        return UUIDType()


our_type = UUIDType()
builtin_type = pa.uuid()

for typ in [our_type, builtin_type]:
    pa_arr = pa.array(byte_list, type=typ)
    pd_series = pd.Series(byte_list, dtype=pd.ArrowDtype(typ))
    print(repr(pa_arr))

    # Both our and the builtin version are interpreted by duckdb as UUIDs!
    pa_table = pa.Table.from_arrays([pa_arr], names=["uuids"])
    print(duckdb.sql("SELECT uuids, TYPEOF(uuids) FROM pa_table"))

@cpcloud It appears that duckdb wants to continue to return UUID as strings by default. If we explicitly SET arrow_losslss_conversion = true, then duckdb will return arrow.uuid type. See https://github.com/duckdb/duckdb/pull/17857 where I tried to implement the change in duckdb and was (kindly) rejected and https://github.com/duckdb/duckdb/pull/18046/files where the docs were updated to show how it actually works.

Does this affect our policy of what we want UUIDs to be represented as? I don't think so, I think we can hopefully use this configuration, or if it doesn't work (eg it affects how other results are materialized) then we can add workarounds.

So, to be clear, the ideal behavior that I could/should submit a PR for is

  • return uuid.UUID objects for UUIDScalar.execute() and UUIDScalar.to_pandas() (pandas has no native way of representing uuids or bytes)
  • return pandas.Series of dtype object, (holding uuid.UUIDs) for UUIDColumn.execute(), UUIDColumn.to_pandas()
  • return the arrow.uuid extension type (either builtin or our backport as above) for UUIDScalar.to_pyarrow() and UUIDColumn.to_pyarrow()

Does that sound like a good plan @cpcloud ?

NickCrews avatar Aug 25 '25 20:08 NickCrews

Well, for duckdb I tried the implementation of SET arrow_losslss_conversion = true as a pre-execute hook before every execution, and this DOES have side effects, in particular when we execute hugeints, with that change we get back DuckDB Hugeint Extension type instead of decimal(38,0).

This could work as way to convert in duckdb as a final step before materializing:

CREATE OR REPLACE MACRO uuid_to_bytes(uuid_val) AS (
    SELECT cast(replace(cast(uuid_val AS VARCHAR), '-', '') AS BLOB)
);

SELECT uuid_to_bytes('123e4567-e89b-12d3-a456-426614174000'::UUID) AS blob;

Or, in the next release of duckdb (PR was merged 5 days ago), we can directly CAST(uuid_val AS BLOB)

NickCrews avatar Aug 26 '25 16:08 NickCrews