Draft strawman data frame "__dataframe__" interchange / data export protocol for discussion
Based on https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/267, we are discussing a "protocol" method (potentially called __dataframe__) similar to __array__ for data frame-like data. The consensus so far is that this protocol should not force conversions to a particular data frame memory model like pandas. Instead, it may provide access to its metadata and data in the form desired by the user of the __dataframe__ protocol.
Questions:
- What if any arguments should be in the abstract
to_numpymethod (respectively,to_arrowmethod)? - Think about abstract Column API for parametric types like categorical
- What kind of multiple-column-selection and row selection APIs should be added
The APIs proposed here are not intended to be to the exclusion of others -- more APIs can be proposed and added later.
Types: Modin does not enforce types or have its own system of types and does not place requirements on types. I think this is better.
Maybe not surprisingly I think types and schemas are good, and libraries like pandas being a bit "loosey goosey" about type metadata has IMHO caused a lot of problems over the last 10 years.
If the producer of a data frame can expose type metadata without requiring a potentially costly conversion, then it seems reasonable to me to permit this. If you know that a column contains integers you might pass different arguments to to_numpy then if it were strings.
As far as this interface having its "own system of types" -- what would be the alternative, to return the metadata of the internal data representation (e.g. a NumPy dtype)? That seems contrary to the goals of this project, which is to avoid exposing details of the data frame producer to the data frame consumer.
I think the requirements for df[col_name].type can also be relaxed so that a data frame producer can return a "type not indicated" value.
Like above I'm interested to see what consumer projects (that don't want to depend on pandas, say) think about this.
Regarding conversion between different data frames, how would libraries convert between the types? It'd be nice to have an API similar to what __array__ does. What I mean is to have this work, with reasonable overhead, minimal memory copy, and assuming they all implement the proposed DataFrame API:
pd.DataFrame(xarray.DataArray(np.ndarray(...)))
Not sure how easy/challenging it is to make it happen though.
@adrinjalali that is effectively what we are discussing here (starting from https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/267) -- i.e. thinking about what is the API of the object returned by a __dataframe__ method, so you would have
class XArray:
...
def __dataframe__(self):
# This class implements the API being discussed here
return XArrayDataFrame(self)
The __array__ method returns a numpy.ndarray, but we have to determine what kind of object __dataframe__ returns and what behavior that object has
EDIT: I just added some comments about this to the PR summary for clarity
With respect to a __array__ analog for dataframe, doesn't that necessitate more dedicated methods like a __pandas_dataframe__, __modin_daframe__, etc? The intent of these methods is to give objects control over producing a concrete ndarray. So the concrete dataframe (say pandas.DataFrame's constructor) would need to check the object for a __pandas_dataframe__ implementation and hand of control of the construction to the object with the __pandas_dataframe__ method.
I'd be curious to know if there's value in a more generic __dataframe__ method, and if so what it would produce. Or do we think that consumers of this protocol fall in one of two camps:
- They definitely want a pandas DataFrame regardless of the input, so they call
pd.DataFrame(data). - They want any dataframe-like object, and so they just require that the input be an object implementing this protocol.
I'd be curious to know if there's value in a more generic
__dataframe__method
This is exactly what is being proposed here and what I understood to be the spirit of the discussion in https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/267. We need to determine what object is returned by __dataframe__ though -- what is in this PR is a proposal for that object's interface.
Both camps of users are served by this.
-
If pandas encounters a foreign object that implements
__dataframe__, thenpd.DataFrame(foreign_object)will work -
Right, if a library wants to accept data frame-like data but not depend on pandas, then they can rather accept any object that has
__dataframe__
I'm still not sure how we can have a unified object returned by __dataframe__ in this scenario. To me, this proposal has two aspects to it, and both are equally important:
The first category is a unified interface for libraries to depend on, which can be used to extract information from the dataframe. Reading feature names, dtypes, etc, is in this category.
The second one is enabling the ecosystem to easily work with one another in an efficient way. If I compare __dataframe__ to __array__, then I'd imagine that the details of how the object is converted to a specific dataframe, should be done in the __datafram__ method, and not the [pandas, xarray, ...] constructor which reads the output of __dataframe__; the same way that numpy expects the output of __array__ to be an ndarray. And to me this needs to be done by the object which implements __dataframe__, since it's the one who knows how to efficiently convert its internal data structures into a given dataframe. So I'd expect for the __dataframe__ to either accept an argument such as pandas, xarray, etc, or to have specialized methods such as __xarray_datarray__, __pandas_dataframe__, etc.
That said, for the dataframes to work nicely with one another, they don't have to implement any of these specialized __*__ methods. They can start by implementing the interface in the first category, and then if such a dataframe is given to pd.DataFrame, the constructor will first check if it has a __pandas_datafram__ implemented, and if not, it will use the public methods to extract the information it needs to create a dataframe.
I hope this makes it a bit more clear on what I meant before.
I think we should be careful about going down the route of having a heavily-overloaded getitem in this interface.
+1
One thing that occurred to me is that the interface is completely column-oriented. But if you have a large file or database with lots of rows, and you read it in row by row, you won't have complete column vectors until you have read everything in, and then you might have some issues with memory.
So my question is whether the interface should define a protocol for creating a DataFrame from row-oriented data.
I made another pass on this per feedback here. I removed all the dunder methods in the interest of being as conservative / explicit as possible. Take a look
I relaxed hashability of column names and changed column_names to return Iterable. PTAL
If someone would like write access on this repository to help lead this effort please let me know. I'm juggling a few too many projects so need to step away from this for a while