duckdb icon indicating copy to clipboard operation
duckdb copied to clipboard

[Python] Add support for Protocols

Open Tishj opened this issue 3 years ago • 11 comments

This PR addresses #4341

It adds support for loading local python files (thanks to @Mause), though they aren't used yet, because the packages we have attempted to use in place of Pandas and Arrow structures don't really respect the protocol, or the internal methods of Pandas/Arrow don't support the use of a protocol (using direct isinstance checks for example)

I think it's worth keeping the architectural changes that allow the use of these protocols in the future though, so I've opted to keep these changes in this PR.

So instead of the protocols, this takes the first step towards increasing compatibility with other packages that subclass arrow/pandas, by using isinstance checks instead of the type name comparisons we did previously.

This also adds support to the PythonImportCache for non-required modules

Changes made by this PR are divided into these categories:

  • Protocol support - changing the package to allow for adding multiple modules (to allow bundling the protocols.py file as part of the duckdb package)
  • Instance checking - instead of comparing the stringified name of the type, check if the type is an instance of the class, allows for compatibility with subclasses, and easy transition to compatibility with classes that adhere to a Protocol (if applicable)
  • Supporting non-required modules with PythonImportCache (such as Arrow and Pandas) - a hook IsRequired is added to mark a module as optional, in addition IsInstance is added to the PythonImportCacheItem because py::isinstance segfaults when a handle is nullptr, which is what is returned when an attribute of an optional module is requested

Tishj avatar Aug 18 '22 14:08 Tishj

Hey @Tishj many thanks for tackling this issue, now that I see the start of this solution, it makes me think this looks like an overkill. I wonder what are the advantages (to justify the code complexity) of adding this protocol layer when compared to a bunch of py::hasatr checks? e.g.,

bool IsThisArrowLike(py:object obj){
return py::hasatr(obj,"...") && ...;
}

pdet avatar Aug 18 '22 14:08 pdet

I think there is a benefit to having these protocols in native python, because contributors that might want to add extra support will probably be python developers, so not having to translate that into c++ is probably a lower threshold.

Apart from that I think there are checks that Protocol accomplishes that we can't with just hasattr, namely method prototypes; We can check that an object implements a method with a given name, but we can't check that it takes n amount of arguments, which Protocol can. (maybe argument types and/or return type is also supported by Protocol, but I'm not sure about that)

Tishj avatar Aug 18 '22 14:08 Tishj

I think there is a benefit to having these protocols in native python, because contributors that might want to add extra support will probably be python developers, so not having to translate that into c++ is probably a lower threshold.

Apart from that I think there are checks that Protocol accomplishes that we can't with just hasattr, namely method prototypes; We can check that an object implements a method with a given name, but we can't check that it takes n amount of arguments, which Protocol can. (maybe argument types and/or return type is also supported by Protocol, but I'm not sure about that)

Are there any concrete examples of libraries that implement the same methods as Arrow and/or Pandas (I guess these are the two main use-cases) that would crash on the method call?

My guess is that this will should throw an error, and in the end, would be the user trying to feed an object of the wrong type to us. If that's the case I would say that's acceptable behavior.

I don't think the Python-Dev support is a strong argument against py::hasattr.

pdet avatar Aug 18 '22 14:08 pdet

What about Dask and Modin for Pandas, and Polars for Arrow?

Alex-Monahan avatar Aug 18 '22 14:08 Alex-Monahan

Hey @Tishj many thanks for tackling this issue, now that I see the start of this solution, it makes me think this looks like an overkill. I wonder what are the advantages (to justify the code complexity) of adding this protocol layer when compared to a bunch of py::hasatr checks? e.g.,

bool IsThisArrowLike(py:object obj){
return py::hasatr(obj,"...") && ...;
}

I think you're also missing the fact that protocols document what's actually expected by duckdb as the input types - not just for DuckDB.org, but they also allow type checking with mypy/pyright etc

Mause avatar Aug 18 '22 15:08 Mause

What about Dask and Modin for Pandas, and Polars for Arrow?

I hadn't heard of Modin before, I'll see if that might be a candidate, but Dask has some issues One example is we use the array method on a Series, and Dask doesn't have it There's an issue pending for this here: https://github.com/dask/dask/issues/5001

>>> pd_df[0].array
<PandasArray>
[5, 4, 1]
Length: 3, dtype: int64
>>> dd_df[0].array
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Series' object has no attribute 'array'

Update: Modin has this same limitation :/

Tishj avatar Aug 18 '22 15:08 Tishj

@Alex-Monahan Can you give an example of a datastructure of Polars that matches that of Pyarrow's? I'm looking at polars.DataFrame and I can't find a lot of similarity between arrow's Table format, or others like InMemoryDataset, FileSystemDataset, RecordBatchReader, Scanner

If you don't have an example handy, don't go through a lot of effort, as I doubt it would work even if they do share an interface, because the methods of Arrow that we call only accept directly inherited subclasses

Tishj avatar Aug 19 '22 13:08 Tishj

@Alex-Monahan Can you give an example of a datastructure of Polars that matches that of Pyarrow's? I'm looking at polars.DataFrame and I can't find a lot of similarity between arrow's Table format, or others like InMemoryDataset, FileSystemDataset, RecordBatchReader, Scanner

If you don't have an example handy, don't go through a lot of effort, as I doubt it would work even if they do share an interface, because the methods of Arrow that we call only accept directly inherited subclasses

Hmm! I am not an expert. Maybe their to_arrow method will show how similar or different they are?

Alex-Monahan avatar Aug 19 '22 15:08 Alex-Monahan

Are you still looking at this @Tishj? It would still be valuable I think, even without the protocol stuff

Mause avatar Sep 13 '22 08:09 Mause

Are you still looking at this @Tishj? It would still be valuable I think, even without the protocol stuff

I had kind of left it, since it's still unclear if the main part of the PR is useful to us. But maybe I can strip some of the protocol related stuff out

Tishj avatar Sep 13 '22 10:09 Tishj

The main part of the PR, being the improved type checking? That is definitely still worth it

Mause avatar Sep 17 '22 14:09 Mause

@Mause @pdet can this be merged or does more need to happen here?

Mytherin avatar Sep 28 '22 07:09 Mytherin

LGTM! I think it can be merged, we dropped the actual use of protocols but the Instance checking and thePythonImportCache are great additions.

Thanks for the great work @Tishj !

pdet avatar Oct 10 '22 11:10 pdet

Thanks!

Mytherin avatar Oct 10 '22 11:10 Mytherin