polars
polars copied to clipboard
Enabling the sub-classing of core data types
Enabling the sub-classing of core data types
Description
Currently, the following test will fail on the last line:
def test_preservation_of_subclasses() -> None:
class SubClassedDataFrame(pl.DataFrame):
pass
df = SubClassedDataFrame({"column_1": [1, 2, 3]})
assert isinstance(df, pl.DataFrame) # OK, is super-class
assert isinstance(df, SubClassedDataFrame) # OK
expanded_df = df.with_column(pl.lit(2).alias("column_2"))
assert isinstance(expanded_df, pl.DataFrame) # OK
assert isinstance(expanded_df, SubClassedDataFrame) # Fail!
That is, pl.DataFrame does support sub-classing, but will "lose" the sub-class once you invoke any method which produces a new object. The idea is that any method that yields a new data frame should use the constructor of its own class.
Motivation
The idea is that if Polars has first-class support for sub-classing, it will allow library users to extend the functionality of the library in an object-oriented manner. Such code might end up looking something like this:
class ExtendedDataFrame(polars.DataFrame):
def __init__(self, *args, **kwargs):
# Code for supporting other data types here...
super().__init__(self, *args, **kwargs)
def validate(self, schema: pydantic.BaseModel) -> ExtendedDataFrame:
# Custom validation logic based on pydantic model here...
def clean(self) -> ExtendedDataFrame:
# Custom data cleaning logic
# And so on...
On one hand this would enable library authors to extend the functionality of Polars with domain-specific functionality, while product developers could create DataFrame structures for representing specific business logic such as:
class Products(polars.DataFrame):
def for_sale(self) -> Products:
return self.filter(pl.col("status") == "for_sale")
# And so on...
It would also be a very distinguishing factor for Polars, creating an additional value add relative to the existing solutions out there. I know that it would provide a huge benefit for the type of code I write!
Changes required
Now, this is is easier said than done, for sure. There are several complications when writing a generic library that supports arbitrary sub-classing of its core data structures. I will try to outline some of the solutions that need to be applied in order to support this use-case.
1. Dynamic class references
The main idea is that any hard-coded references to DataFrame/DataFrame.__init__/DataFrame.__new__ needs to be replaced with dynamic class references. Take the following snippet as an example:
@classmethod
def _from_dicts(cls, data: Sequence[Dict[str, Any]]) -> "DataFrame":
pydf = PyDataFrame.read_dicts(data)
return DataFrame._from_pydf(pydf)
The problem here is the hard-coded reference to DataFrame, which needs to be replaced with a dynamic reference to cls instead. That way the type is preserved.
@classmethod
def _from_dicts(cls, data: Sequence[Dict[str, Any]]) -> "DataFrame":
pydf = PyDataFrame.read_dicts(data)
return cls._from_pydf(pydf)
2. Adapting constructor functions
Most constructions of new DataFrame objects is done through the wrap_df utility method, defined as follows:
def wrap_df(df: "PyDataFrame") -> "DataFrame":
return DataFrame._from_pydf(df)
Here we also have a hard-coded reference to DataFrame. An example invocation is in DataFrame.__mul__:
def __mul__(
self, other: Union["DataFrame", "pli.Series", int, float, bool]
) -> "DataFrame":
if isinstance(other, DataFrame):
return wrap_df(self._df.mul_df(other._df))
The problem here is how wrap_df accepts a PyDataFrame object instead of the DataFrame object. The type information is therefore lost. I guess PyDataFrame is a class implemented in Rust and made available through the rust bindings? Some different approaches come to mind:
- Add an additional
clsparameter towrap_df()with default valueDataFrame. Pass in the correct class inDataFramemethods where we would like to preserve the type. - Implement
DataFrame._wrap_df(), a clasmethod implemented much likewrap_dfonly that it usesclsto construct a new dataframe instance instead. - It also looks like we can invoke
self._from_pydf()directly instead ofwrap_df()in most methods ofDataFrame. Any reasons not to do so? - Add some (optional?) reference on the
PyDataFrameobject to the class which is supposed to wrap it. Here my expertise is lacking to evaluate the feasibility since I'm not that familiar with rust.
I'm currently leaning toward option 3.
3. Change type annotations (optional)
To illustrate this change, take the following example code:
class Foo:
@classmethod
def bar(cls) -> "Foo":
return cls()
class SubFoo(Foo):
pass
reveal_type(Foo.bar()) # Foo (correct)
reveal_type(SubFoo.bar()) # Foo (incorrect)
Here the problem is how the type annotation of the return type of Foo.bar is hard-coded to Foo, although it is really the type of self. The solution is to type it as follows:
from typing import Type, TypeVar
FooType = TypeVar("FooType", bound="Foo")
class Foo:
@classmethod
def bar(cls: Type[FooType]) -> FooType:
return cls()
class SubFoo(Foo):
pass
reveal_type(Foo.bar()) # Foo (correct)
reveal_type(SubFoo.bar()) # SubFoo (correct)
For Polars this would mean that we would have to type annotate most of the DataFrame methods with DataFrameType = TypeVar("DataFrameType", bound="DataFrame") in order for sub-classes to get correct type inference by most type checkers. This will become even easier to type once PEP 673 -- Self Type is available for use as well.
4. Preserve sub-classes after LazyFrame round-trips
One of the main problems with Polars dataframes are how they are often casted between DataFrame and LazyDataFrame (and back). We would ideally like to see .lazy().collect() return the same type as the original object. With other words, we would have to store the original wrapper class on LazyFrame and the use it when going back with .collect() and so on. The same goes for other classes such as GBSelection.
5. Allow users to sub-class both pl.DataFrame, pl.LazyFrame, and pl.Series and connect them together
This is starting to become a bit too complex perhaps, and it is not strictly necessary. But I will write it here anyway. You could imagine users being able to specify something like this:
class MySeries(pl.Series):
...
class MyDataFrame(pl.DataFrame):
SERIES_CLASS = MySeries
...
And that any methods on pl.DataFrame that construct series would use self.SERIES_CLASS instead of hard-coded references to pl.Series for instance. But I will write up a separate issue discuss it if we get there!
Conclusion
As you can see, there is quite a lot of things to keep in mind, but I think it would offer some really nice benefits to the library users. One thing that is important to note is that all five of the aforementioned tasks are not required and I/we could implement any subset of them.
If this use case makes sense to you I could make an attempt at solving task 1, 2, and 3 :relaxed: Task 4 and 5 might come later in separate PRs.
I also totally understand that you would have to see how a PR would look like in practice in order to make the final decision.
I have created a PR for tasks 1-3 as outlined above here: https://github.com/pola-rs/polars/pull/2859
Any updates on this?
Is there any workaround for this now?
We have decided not to support subclassing our core data types.
You can work around this in a few ways.
- Extend the core classes with your own methods using custom namespaces.
- Wrap the core classes with your own class using composition, e.g.:
import polars as pl
from typing import Any
class MyDF:
def __init__(self, *args: Any, **kwargs: Any) -> None:
self._df = pl.DataFrame(*args, **kwargs)
def __getattr__(self, name: str) -> Any:
"""Redirect any calls to the underlying Polars DataFrame."""
return getattr(self._df, name)
def select(self, column: str) -> None:
"""Custom select function."""
...