polars
polars copied to clipboard
Implement the Array API
Implementing the Array API (https://data-apis.org/array-api/latest/purpose_and_scope.html) would improve the long term interoperability of the data science libraries.
The conformance can be tested using: https://github.com/data-apis/array-api-tests
I know Polars is much higher level lib, but I believe conforming to this protocol while using Polars components could make sense
This indeed sounds very interesting. I saw that the API is in its own namespace so it would not interfere with our Series API.
It would be pretty neat if consumers like scikit-learn could work with the array API. This prevents a copy to numpy.
I agree. Awesome initiative. The API seems directed towards tensors, but offering the 1D experience is still quite powerful.
A little bit offtopic, but I was always wondering: My understanding is that 1D series is pretty straightforward with Arrow and Polars and while it's clunky 2D arrays still work well (vector of Series).
Does Arrow support or prevent efficient tensor representation? Is it materially different and we don't support things that numpy / ndarray / tensorflow handles well? Could we run eg. BLAS / LAPACK over data in Arrow (directly, efficiently)?
Arrow's IPC specification includes both tensors and sparse tensors.
The reason I have not added them to the arrow2 is that there are no integration tests for them atm, so, it is pretty much a wild west. This is an area I wish we could improve in the future, so that we can e.g. have tensors in polars.
There is the FixedSizeList type. And otherwise you could build a matrix, tensor type around a 1D array. I think all serious tensors are backed by contiguous memory and have their dimensions due to there indexing magic.
So a numpy array is a Series of uniform 1d tensors or a FixedSizeList of integer/float vectors? Interesting, thanks a lot for the details.
Yes matrices are typically backed by 1D memory because a Vec<Vec<_>> would have a cache miss at every row/column traversal (and more in higher dimensions).
I assume arrow lists are backed by linear memory for the same reason.
There is also the DataFrame API which would seem a better fit for polars:
- https://github.com/data-apis/dataframe-api
It would be pretty neat if consumers like scikit-learn could work with the array API.
+:100:
There is also the DataFrame API which would seem a better fit for
polars:
- https://github.com/data-apis/dataframe-api
It would be pretty neat if consumers like scikit-learn could work with the array API.
+:100:
I'd think we'd want to conform with both, no?
I'd think we'd want to conform with both, no?
I think they're two separate things. You're either trying to provide a 2D DataFrame api or an nD Tensor api. It may be that the DataFrame api is implemented as a collection of 1D arrays conforming to the api, but I'd imagine that the DataFrame standard would specify that.
As an outside, occasional user, it seems to me that polars is trying to implement a 2D DataFrame api so would best conform to the DataFrame standard. I'm not an expert in polars though!
I imagine projects like NumPy are targeted for the Array API. So not sure if Series fits here, and if it doesn't then the next question is where does that line get drawn with upstream structures used?
But I'd assume both DataFrames and Series will consume arrays conforming to the API.
Found this comment.
It may be of interest that it looks like Pandas now implements the DataFrame part of the Array API specification as of yesterday's 1.5.0 release: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.5.0.html#dataframe-interchange-protocol-implementation
Yeap, I want that too. Any help on this would be very much appreciated.
It would be pretty neat if consumers like scikit-learn could work with the array API. This prevents a copy to numpy.
This is WIP! See https://github.com/scikit-learn/scikit-learn/issues/22352 for a general overview and https://github.com/scikit-learn/scikit-learn/pull/22554 for its first experimental support.
As @dhirschfeld pointed out, the DataFrame API might make more sense for polars.