polars icon indicating copy to clipboard operation
polars copied to clipboard

Implement the Array API

Open alippai opened this issue 3 years ago • 14 comments

Implementing the Array API (https://data-apis.org/array-api/latest/purpose_and_scope.html) would improve the long term interoperability of the data science libraries.

The conformance can be tested using: https://github.com/data-apis/array-api-tests

I know Polars is much higher level lib, but I believe conforming to this protocol while using Polars components could make sense

alippai avatar Jan 01 '22 22:01 alippai

This indeed sounds very interesting. I saw that the API is in its own namespace so it would not interfere with our Series API.

It would be pretty neat if consumers like scikit-learn could work with the array API. This prevents a copy to numpy.

ritchie46 avatar Jan 02 '22 19:01 ritchie46

I agree. Awesome initiative. The API seems directed towards tensors, but offering the 1D experience is still quite powerful.

jorgecarleitao avatar Jan 02 '22 19:01 jorgecarleitao

A little bit offtopic, but I was always wondering: My understanding is that 1D series is pretty straightforward with Arrow and Polars and while it's clunky 2D arrays still work well (vector of Series).

Does Arrow support or prevent efficient tensor representation? Is it materially different and we don't support things that numpy / ndarray / tensorflow handles well? Could we run eg. BLAS / LAPACK over data in Arrow (directly, efficiently)?

alippai avatar Jan 02 '22 20:01 alippai

Arrow's IPC specification includes both tensors and sparse tensors.

The reason I have not added them to the arrow2 is that there are no integration tests for them atm, so, it is pretty much a wild west. This is an area I wish we could improve in the future, so that we can e.g. have tensors in polars.

jorgecarleitao avatar Jan 02 '22 20:01 jorgecarleitao

There is the FixedSizeList type. And otherwise you could build a matrix, tensor type around a 1D array. I think all serious tensors are backed by contiguous memory and have their dimensions due to there indexing magic.

ritchie46 avatar Jan 02 '22 20:01 ritchie46

So a numpy array is a Series of uniform 1d tensors or a FixedSizeList of integer/float vectors? Interesting, thanks a lot for the details.

alippai avatar Jan 02 '22 20:01 alippai

Yes matrices are typically backed by 1D memory because a Vec<Vec<_>> would have a cache miss at every row/column traversal (and more in higher dimensions).

I assume arrow lists are backed by linear memory for the same reason.

ritchie46 avatar Jan 02 '22 21:01 ritchie46

There is also the DataFrame API which would seem a better fit for polars:

  • https://github.com/data-apis/dataframe-api

It would be pretty neat if consumers like scikit-learn could work with the array API.

+:100:

dhirschfeld avatar Mar 23 '22 05:03 dhirschfeld

There is also the DataFrame API which would seem a better fit for polars:

  • https://github.com/data-apis/dataframe-api

It would be pretty neat if consumers like scikit-learn could work with the array API.

+:100:

I'd think we'd want to conform with both, no?

cnpryer avatar Jun 05 '22 02:06 cnpryer

I'd think we'd want to conform with both, no?

I think they're two separate things. You're either trying to provide a 2D DataFrame api or an nD Tensor api. It may be that the DataFrame api is implemented as a collection of 1D arrays conforming to the api, but I'd imagine that the DataFrame standard would specify that.

As an outside, occasional user, it seems to me that polars is trying to implement a 2D DataFrame api so would best conform to the DataFrame standard. I'm not an expert in polars though!

dhirschfeld avatar Jun 05 '22 10:06 dhirschfeld

I imagine projects like NumPy are targeted for the Array API. So not sure if Series fits here, and if it doesn't then the next question is where does that line get drawn with upstream structures used?

But I'd assume both DataFrames and Series will consume arrays conforming to the API.

Found this comment.

cnpryer avatar Jun 05 '22 14:06 cnpryer

It may be of interest that it looks like Pandas now implements the DataFrame part of the Array API specification as of yesterday's 1.5.0 release: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.5.0.html#dataframe-interchange-protocol-implementation

kylebarron avatar Sep 21 '22 15:09 kylebarron

Yeap, I want that too. Any help on this would be very much appreciated.

ritchie46 avatar Sep 21 '22 15:09 ritchie46

It would be pretty neat if consumers like scikit-learn could work with the array API. This prevents a copy to numpy.

This is WIP! See https://github.com/scikit-learn/scikit-learn/issues/22352 for a general overview and https://github.com/scikit-learn/scikit-learn/pull/22554 for its first experimental support.

As @dhirschfeld pointed out, the DataFrame API might make more sense for polars.

jjerphan avatar Oct 21 '22 19:10 jjerphan