enh: `Implementation` for plugins
It might be worth thinking about how plugins / extensions can interact with Implementation
Implementation is used in some places such as:
scan_parquetfrom_dict- ...
If someone has, say, narwhals-daft installed, I think they should be able to do
df: nw.LazyFrame # `df` is a narwhals LazyFrame backed by Daft
nw.scan_parquet(file, backend=df.implementation)
and
nw.scan_parquet(file, backend='narwhals-daft')
The crucial parts are:
https://github.com/narwhals-dev/narwhals/blob/8bfeb016bbaaddab4e0d4c98e0a006f5d6b465ed/narwhals/functions.py#L858-L859
https://github.com/narwhals-dev/narwhals/blob/8bfeb016bbaaddab4e0d4c98e0a006f5d6b465ed/narwhals/functions.py#L891-L898
We may need to transition Implementation away from being an Enum to something which can be extended, and we have a mechanism for which plugins can add their own one. For example, when we discover plugins, we do
In [1]: from importlib.metadata import entry_points
In [2]: entry_points(group='narwhals.plugins')
Out[2]: (EntryPoint(name='narwhals-daft', value='narwhals_daft', group='narwhals.plugins'),)
The name of the entrypoint (here, narwhals-daft) can serve as the name for the Implementation (Implementation.NARWHALS_DAFT), and we impose that plugin names must be unique (i.e. you can't install two plugins with the same name for the 'narwhals.plugins' entrypoint)
@dangotbanned @FBruzzesi as you've both been looking at Implementation recently
I don't think we need this.
Implementation makes sense as an Enum because it allows branching on specfic backends when working downstream.
The idea I had in (#2786) would mean passing in any object that can provide __narwhals_namespace__ - but with whatever constraints the call has.
E.g. if we want a scan_parquet, then calling this method is much simpler and works within the type system
namespace: DaftNamespace = obj.__narwhals_namespace__(...)
namespace.scan_parquet(...)
We can define all of that without explicitly naming any plugins or changing Implementation
sure, and what do you propose doing to:
implementation = Implementation.from_backend(backend)
native_namespace = implementation.to_native_namespace()
when backend is say 'narwhals-daft'? And how would
df: nw.LazyFrame # `df` is a narwhals LazyFrame backed by Daft
nw.scan_parquet(file, backend=df.implementation)
work?
And how would
df: nw.LazyFrame #
dfis a narwhals LazyFrame backed by Daftnw.scan_parquet(file, backend=df.implementation)
work?
I'm just answering this one for now as it is simpler 😄
BaseFrame already satisfies the protocol
https://github.com/narwhals-dev/narwhals/blob/82052e24cb27aee51cd6cc357ed2d1f680b1807c/narwhals/dataframe.py#L107-L115
So we could just write:
df: nw.LazyFrame # `df` is a narwhals LazyFrame backed by Daft
nw.scan_parquet(file, backend=df)
Which then internally does:
def scan_parquet(
source: str, *, backend: IntoBackend[Backend], **kwargs: Any # `IntoBackend` can be extended
) -> LazyFrame[Any]:
implementation = Implementation.from_backend(backend)
if implementation is not Implementation.UNKNOWN:
ns = Version.MAIN.namespace.from_backend(implementation).compliant
elif supports_narwhals_namespace(backend):
ns = backend.__narwhals_namespace__()
else:
raise TypeError
return ns.scan_parquet(source, **kwargs)
Sure, we can write something else, but
nw.scan_parquet(file, backend=df.implementation)
would not be supported, right?
Sure, we can write something else, but
nw.scan_parquet(file, backend=df.implementation)would not be supported, right?
I'm okay with this.
If that has to be supported, when there is an alternative then two other ideas are:
- Resolve the
Implementation.UNKNOWNinsidescan_parquetby interfacing withEntryPointsi. I'm guessing the equivalent of this would need to happen anyway, even ifImplementation.NARWHALS_DAFTworked - Following (#3016), define behavior inside the descriptor to return something instead of
Implementation.UNKNOWNondf.implementationwhen an extension is installed
sure, makes sense, thanks
i'm also ok with not supporting that if alternatives are available (we can always leave a note about this in the docstring), at least for now
scan_parquet(file, backend='narwhals-daft') which then gets resolved directly to a compliant namespace (bypassing Implementation completely) seems fine too
scan_parquet(file, backend='narwhals-daft')which then gets resolved directly to a compliant namespace (bypassingImplementationcompletely) seems fine too
Yep, couldn't have said it better myself 😄
I'm just answering this one for now as it is simpler 😄
Also I think Implementation.UNKNOWN might be a nice invariant to have for if someone explicitly doesn't want to support extensions.
Not sure how common that may be - but it's fairly simple with the current Enum:
import narwhals as nw
def what(backend: nw.Implementation):
if backend is not nw.Implementation.UNKNOWN:
...
if backend is nw.Implementation.UNKNOWN:
...
Or with new sugar:
def what(backend: nw.Implementation):
if backend.is_known():
...
if backend.is_unknown():
...
Hellooo 👋🏼 Apologies for the late feedback on this - My days at work are a bit hectic.
I have been thinking a bit on how to extend Implementation but I don't have a concrete solution, yet please bear with my half backed idea, which might inspire further iterations (or the final decision to thresh it)
The idea is to have one of the Implementation member as an extension registry. This is totally allowed by an enum:
code - but actually it's better to jump to the "slight variation"
from __future__ import annotations
from collections import UserDict
from enum import Enum
from types import ModuleType
import pandas as pd
import polars as pl
class ExtensionRegistry(UserDict):
active: None | str = None
def activate(self, module_name: str) -> None:
if module_name not in self.data:
msg = f"Unknown extension '{module_name}'"
raise ValueError(msg)
self.active = module_name
def to_native_namespace(self) -> ModuleType:
if self.active is None:
raise
return self.data[self.active]
class Implementation(Enum):
PANDAS = "pandas"
PYARROW = "pyarrow"
UNKNOWN = "unknown"
EXTENSION = ExtensionRegistry()
@classmethod
def register_extension(
cls,
module_name: str,
module: ModuleType,
) -> None:
if module_name in Implementation:
msg = f"Cannot overwrite '{module_name}'"
raise ValueError(msg)
if module_name in Implementation.EXTENSION.value.data:
msg = f"'{module_name}' already registered"
raise ValueError(msg)
cls.EXTENSION.value.data[module_name] = module
@classmethod
def from_string(cls, backend_name: str):
try:
return cls(backend_name)
except ValueError:
if backend_name in Implementation.EXTENSION.value:
impl = Implementation.EXTENSION
impl.value.activate(backend_name)
return impl
else:
return Implementation.UNKNOWN
def to_native_namespace(self) -> ModuleType: ...
def read_pq(filename: str, impl: Implementation) -> str:
if impl is Implementation.EXTENSION:
return impl.value.to_native_namespace().read_parquet(filename)
else:
# ok in this specific scenario, this line would fail since Implementation
# does not have `to_native_namespace`, but we have it in Narwhals 😂
ns = impl.to_native_namespace()
return ns.read_parquet(filename)
filename = "foo.pq"
pl.DataFrame({"a": [1,2,3]}).write_parquet(filename)
Implementation.register_extension("polars", pl)
pd_impl = Implementation.from_string("pandas")
pl_impl = Implementation.from_string("polars")
print(pd_impl, pl_impl)
Implementation.PANDAS Implementation.EXTENSION
print(read_pq(filename=filename, impl=pl_impl))
shape: (3, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 1 │
│ 2 │
│ 3 │
└─────┘
Disclaimers
I am happy you made it this far 🎉 Now it's time for disclaimers:
- As we have quite a few edge cases I am not fully sure this would work in all scenarios
- I don't know how much overhead this adds
- I am sure that some attribute we can sugar code e.g.
Implementation.EXTENSION.value.dataas aextension_registryproperty or something. - Of course we can register more information that the name and namespace/module
- I am not sure how it would work if multiple external packages try to register together - on this point, maybe we could load the entrypoints in the registry to being able to access the module from the implementation.
Slight variation
Slight variation with a dedicated class ExtensionImplementation (which is not an enum but has the same methods of Implementation). Honestly this looks nicer to me because the one above has:
- Inconsistent Interface:
Implementation.EXTENSIONbehaves differently from other enum members. - State Management: The registry has mutable state (
active) that feels a bit awkward in an enum context.
class ExtensionRegistry(UserDict):
def register(self, module_name: str, module: ModuleType) -> None:
if module_name in self.data:
msg = f"Extension '{module_name}' already registered"
raise ValueError(msg)
self.data[module_name] = module
def get_module(self, module_name: str) -> ModuleType:
if module_name not in self.data:
msg = f"Unknown extension '{module_name}'"
raise ValueError(msg)
return self.data[module_name]
def is_registered(self, module_name: str) -> bool:
return module_name in self.data
class Implementation(Enum):
PANDAS = "pandas"
PYARROW = "pyarrow"
UNKNOWN = "unknown"
_extensions = ExtensionRegistry()
@classmethod
def register_extension(cls, module_name: str, module: ModuleType) -> None:
# Check against built-in implementations
if module_name in Implementation:
msg = f"Cannot overwrite built-in implementation '{module_name}'"
raise ValueError(msg)
cls._extensions.value.register(module_name, module)
@classmethod
def from_string(cls, backend_name: str):
# Try built-in implementations first
try:
return cls(backend_name)
except ValueError:
# Check extensions
if cls._extensions.value.is_registered(backend_name):
return ExtensionImplementation(backend_name, cls._extensions)
else:
return Implementation.UNKNOWN
def to_native_namespace(self) -> ModuleType:
"""Get the native module for built-in implementations"""
if self == Implementation.PANDAS:
import pandas as pd
return pd
elif self == Implementation.PYARROW:
import pyarrow as pa
return pa
else:
raise ValueError(f"No native namespace available for {self}")
class ExtensionImplementation:
"""Wrapper for extension implementations to provide consistent interface"""
def __init__(self, name: str, registry: ExtensionRegistry):
self.name = name
self._registry = registry
@property
def value(self) -> str:
"""Return the name to match enum interface"""
return self.name
def to_native_namespace(self) -> ModuleType:
"""Get the registered module"""
return self._registry.value.get_module(self.name)
def __str__(self) -> str:
return f"ExtensionImplementation.{self.name.upper()}"
def __repr__(self) -> str:
return f"ExtensionImplementation('{self.name}')"
def read_pq(filename: str, impl: Implementation | ExtensionImplementation) -> str:
ns = impl.to_native_namespace()
return ns.read_parquet(filename)
# Register extensions
Implementation.register_extension("polars", pl)
# Get implementations
pd_impl = Implementation.from_string("pandas")
pl_impl = Implementation.from_string("polars")
unknown_impl = Implementation.from_string("nonexistent")
print(f"Pandas: {pd_impl}")
print(f"Polars: {pl_impl}")
print(f"Unknown: {unknown_impl}")
Pandas: Implementation.PANDAS
Polars: ExtensionImplementation.POLARS
Unknown: Implementation.UNKNOWN
print(read_pq(filename=filename, impl=pl_impl))
shape: (3, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 1 │
│ 2 │
│ 3 │
└─────┘
Sorry for the flow of consciousness debating against myself
Chiming in with some (minimalistic) thoughts here.
Enums are designed to not be mutable at runtime, while we can circumvent this (e.g. mutable data structure as enum member) it creates some surprising code that we need to maintain, plug-in authors need to reason about, and our users need to reach for.
With that said, there is a need to provide the user something to "grab onto" when working with various Narwhals functions (e.g. it is sensible for a user to reach for Implementation.PANDAS when they want a function to be carried out via pandas code).
To keep things simple, what would happen if we didn't include Implementations for plug-ins?
Instead we can just have users rely on the import so that functions like scan_parquet can rely on a ModuleType check (e.g. if the user passed a module, directly attempt to call its scan_parquet function).
import narwhals as nw
import narwhals_daft
nw.scan_parquet(…, backend=narwhals_daft)
Pros:
- Remove constraints on downstream plugin authors to need to interact with any registry mechanisms (which may change over time)
- Reduces the number of objects that users need to be aware of
Cons:
- Additional codepath for testing/maintenance
- Plugins follow a different codepath than internalized backends
Open Ended:
- What happens if a Plugin needs to use
Implementations internally (e.g. a plugin supports multiple backends in the same way that_pandas_likedoes?