dataframe-protocol icon indicating copy to clipboard operation
dataframe-protocol copied to clipboard

Draft strawman data frame "__dataframe__" interchange / data export protocol for discussion

Open wesm opened this issue 6 years ago • 11 comments

Based on https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/267, we are discussing a "protocol" method (potentially called __dataframe__) similar to __array__ for data frame-like data. The consensus so far is that this protocol should not force conversions to a particular data frame memory model like pandas. Instead, it may provide access to its metadata and data in the form desired by the user of the __dataframe__ protocol.

Questions:

  • What if any arguments should be in the abstract to_numpy method (respectively, to_arrow method)?
  • Think about abstract Column API for parametric types like categorical
  • What kind of multiple-column-selection and row selection APIs should be added

The APIs proposed here are not intended to be to the exclusion of others -- more APIs can be proposed and added later.

wesm avatar Mar 14 '20 00:03 wesm

Types: Modin does not enforce types or have its own system of types and does not place requirements on types. I think this is better.

Maybe not surprisingly I think types and schemas are good, and libraries like pandas being a bit "loosey goosey" about type metadata has IMHO caused a lot of problems over the last 10 years.

If the producer of a data frame can expose type metadata without requiring a potentially costly conversion, then it seems reasonable to me to permit this. If you know that a column contains integers you might pass different arguments to to_numpy then if it were strings.

As far as this interface having its "own system of types" -- what would be the alternative, to return the metadata of the internal data representation (e.g. a NumPy dtype)? That seems contrary to the goals of this project, which is to avoid exposing details of the data frame producer to the data frame consumer.

I think the requirements for df[col_name].type can also be relaxed so that a data frame producer can return a "type not indicated" value.

Like above I'm interested to see what consumer projects (that don't want to depend on pandas, say) think about this.

wesm avatar Mar 15 '20 22:03 wesm

Regarding conversion between different data frames, how would libraries convert between the types? It'd be nice to have an API similar to what __array__ does. What I mean is to have this work, with reasonable overhead, minimal memory copy, and assuming they all implement the proposed DataFrame API:

pd.DataFrame(xarray.DataArray(np.ndarray(...)))

Not sure how easy/challenging it is to make it happen though.

adrinjalali avatar Mar 16 '20 16:03 adrinjalali

@adrinjalali that is effectively what we are discussing here (starting from https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/267) -- i.e. thinking about what is the API of the object returned by a __dataframe__ method, so you would have

class XArray:
    ...

    def __dataframe__(self):
        # This class implements the API being discussed here
        return XArrayDataFrame(self)

The __array__ method returns a numpy.ndarray, but we have to determine what kind of object __dataframe__ returns and what behavior that object has

EDIT: I just added some comments about this to the PR summary for clarity

wesm avatar Mar 16 '20 16:03 wesm

With respect to a __array__ analog for dataframe, doesn't that necessitate more dedicated methods like a __pandas_dataframe__, __modin_daframe__, etc? The intent of these methods is to give objects control over producing a concrete ndarray. So the concrete dataframe (say pandas.DataFrame's constructor) would need to check the object for a __pandas_dataframe__ implementation and hand of control of the construction to the object with the __pandas_dataframe__ method.

I'd be curious to know if there's value in a more generic __dataframe__ method, and if so what it would produce. Or do we think that consumers of this protocol fall in one of two camps:

  1. They definitely want a pandas DataFrame regardless of the input, so they call pd.DataFrame(data).
  2. They want any dataframe-like object, and so they just require that the input be an object implementing this protocol.

TomAugspurger avatar Mar 17 '20 20:03 TomAugspurger

I'd be curious to know if there's value in a more generic __dataframe__ method

This is exactly what is being proposed here and what I understood to be the spirit of the discussion in https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/267. We need to determine what object is returned by __dataframe__ though -- what is in this PR is a proposal for that object's interface.

Both camps of users are served by this.

  1. If pandas encounters a foreign object that implements __dataframe__, then pd.DataFrame(foreign_object) will work

  2. Right, if a library wants to accept data frame-like data but not depend on pandas, then they can rather accept any object that has __dataframe__

wesm avatar Mar 17 '20 23:03 wesm

I'm still not sure how we can have a unified object returned by __dataframe__ in this scenario. To me, this proposal has two aspects to it, and both are equally important:

The first category is a unified interface for libraries to depend on, which can be used to extract information from the dataframe. Reading feature names, dtypes, etc, is in this category.

The second one is enabling the ecosystem to easily work with one another in an efficient way. If I compare __dataframe__ to __array__, then I'd imagine that the details of how the object is converted to a specific dataframe, should be done in the __datafram__ method, and not the [pandas, xarray, ...] constructor which reads the output of __dataframe__; the same way that numpy expects the output of __array__ to be an ndarray. And to me this needs to be done by the object which implements __dataframe__, since it's the one who knows how to efficiently convert its internal data structures into a given dataframe. So I'd expect for the __dataframe__ to either accept an argument such as pandas, xarray, etc, or to have specialized methods such as __xarray_datarray__, __pandas_dataframe__, etc.

That said, for the dataframes to work nicely with one another, they don't have to implement any of these specialized __*__ methods. They can start by implementing the interface in the first category, and then if such a dataframe is given to pd.DataFrame, the constructor will first check if it has a __pandas_datafram__ implemented, and if not, it will use the public methods to extract the information it needs to create a dataframe.

I hope this makes it a bit more clear on what I meant before.

adrinjalali avatar Mar 18 '20 12:03 adrinjalali

I think we should be careful about going down the route of having a heavily-overloaded getitem in this interface.

+1

GaelVaroquaux avatar Mar 18 '20 17:03 GaelVaroquaux

One thing that occurred to me is that the interface is completely column-oriented. But if you have a large file or database with lots of rows, and you read it in row by row, you won't have complete column vectors until you have read everything in, and then you might have some issues with memory.

So my question is whether the interface should define a protocol for creating a DataFrame from row-oriented data.

Dr-Irv avatar Mar 19 '20 18:03 Dr-Irv

I made another pass on this per feedback here. I removed all the dunder methods in the interest of being as conservative / explicit as possible. Take a look

wesm avatar Apr 08 '20 22:04 wesm

I relaxed hashability of column names and changed column_names to return Iterable. PTAL

wesm avatar Apr 09 '20 15:04 wesm

If someone would like write access on this repository to help lead this effort please let me know. I'm juggling a few too many projects so need to step away from this for a while

wesm avatar Apr 10 '20 15:04 wesm