ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat: support UUIDs to pyarrow on more backends

Open NickCrews opened this issue 10 months ago • 1 comments

partially fixes #8902.

Implements UUID execution to pyarrow on some backends, and adds notimpl tests for the rest.

NickCrews avatar Apr 05 '24 18:04 NickCrews

OK, I think this brings up a larger philosophical question: Do we want to totally separate the pandas and pyarrow codepaths, or can they rely on each other?

Currently, to get pyarrow results from a backend:

  1. for some backends we go straight from the DB cursor object to pyarrow arrays, never needing pandas.
  2. In the backends that I touch in this PR, we go through the path of db_cursor -> pandas -> pyarrow.

I think the coupling between pandas and pyarrow for this conversion isn't inherently bad (we don't need to implement the db -> pyarrow path!), but I agree that it should be isolated, so we are very clear where we are mixing these two ecosystems, so that for the backends that don't need it, you can just have pyarrow installed, you don't need pandas.

So I see two options:

  1. keep this db_cursor -> pandas -> pyarrow path, but just sequester it into some 3rd module that is external to both ibis/formats/pandas.py and ibis/formats/pyarrow.py
  2. in these backends that don't have it yet, implement the db_cursor -> pyarrow conversion directly.

I think I would lean towards 2. I want to remove reliance on pandas as much as possible. Possibly this implementation won't be that hard for these other backends.

NickCrews avatar Jun 30 '24 18:06 NickCrews

I think we'd to eventually be able to offer Ibis without requiring pyarrow or pandas, or least without requiring pandas. Many systems are starting to have arrow-native endpoints that don't involve pandas, so db -> pyarrow is actually better for those cases.

There's also the potential of using something that doesn't depend on either of those for the core (like printing tables), so I think we'd like to keep things as isolated from one another as possible.

Even more is the fact that sending anything through pandas is likely to result in some kind of type or value alteration that doesn't happen with pyarrow. Especially with NULLs, pandas is likely to do something completely different and incompatible with what pyarrow would do.

cpcloud avatar Jul 01 '24 18:07 cpcloud

Ok, when I get back to this I'll try the db -> arrow method!

NickCrews avatar Jul 01 '24 18:07 NickCrews

Is this PR still viable?

cpcloud avatar Jul 23 '24 18:07 cpcloud

viable, I just stopped needing it personally so the urgency of it dropped a lot compared the 5 million other PRs I have open haha. Feel free to close if you want, and re-open once someone actually finds time to work on it.

NickCrews avatar Jul 23 '24 19:07 NickCrews