vaex
vaex copied to clipboard
[FEATURE-REQUEST] Support interchanging vaex dataframes with Arrow-backend columns
Initialising an interchange protocol buffer (_VaexBuffer
) only works for vaex columns with NumPy backends
https://github.com/vaexio/vaex/blob/35c250d585f889272b8ef1096de6fa5462816f52/packages/vaex-core/vaex/dataframe_protocol.py#L241-L244
_VaexBuffer.__init__()
is private API, but affects interchange with different libraries as this is called when using the public API of Column.get_buffers()
packages/vaex-core/vaex/dataframe_protocol.py:565: in get_buffers
buffers["data"] = self._get_data_buffer()
packages/vaex-core/vaex/dataframe_protocol.py:603: in _get_data_buffer
buffer = _VaexBuffer(self._col.values)
So obviously it'd be nice (if not practically essential?) if vaex supported interchanging Arrow-backend columns too. I just thought to raise this issue as a tracker, as I didn't quite see relevant conversation in https://github.com/vaexio/vaex/pull/1509. cc @maartenbreddels
Even if the buffer is stored as numpy array, it can still mean the underlying data is an arrow array.
I think it should be possible to do arrow->protocol->arrow without a memory copy. At least that's how we designed the spec AFAIKR. It could be that the implementation is missing some parts still.
Ah so you fixed the issue I was alluding to in #2122
- buffer = _VaexBuffer(self._col.values)
+ buffer = _VaexBuffer(indices.to_numpy())
Before a test like the following would fail
def test_smoke_get_buffers(df_factory):
x = np.arange(5)
df = df_factory(x=x)
df = df.categorize("x")
interchange_df = df.__dataframe__()
interchange_col = interchange_df.get_column_by_name("x")
interchange_col.get_buffers()
for the pyarrow(+chunked) dataframe. So I think you're all good? I'll get to forcibly generate Arrow-backend examples for dataframe-interchange-tests
.
Wrote a regression test https://github.com/vaexio/vaex/pull/2135