narwhals icon indicating copy to clipboard operation
narwhals copied to clipboard

enh: `Implementation` for plugins

Open MarcoGorelli opened this issue 3 months ago • 10 comments

It might be worth thinking about how plugins / extensions can interact with Implementation

Implementation is used in some places such as:

  • scan_parquet
  • from_dict
  • ...

If someone has, say, narwhals-daft installed, I think they should be able to do

df: nw.LazyFrame # `df` is a narwhals LazyFrame backed by Daft

nw.scan_parquet(file, backend=df.implementation)

and

nw.scan_parquet(file, backend='narwhals-daft')

The crucial parts are:

https://github.com/narwhals-dev/narwhals/blob/8bfeb016bbaaddab4e0d4c98e0a006f5d6b465ed/narwhals/functions.py#L858-L859

https://github.com/narwhals-dev/narwhals/blob/8bfeb016bbaaddab4e0d4c98e0a006f5d6b465ed/narwhals/functions.py#L891-L898

We may need to transition Implementation away from being an Enum to something which can be extended, and we have a mechanism for which plugins can add their own one. For example, when we discover plugins, we do

In [1]: from importlib.metadata import entry_points

In [2]: entry_points(group='narwhals.plugins')
Out[2]: (EntryPoint(name='narwhals-daft', value='narwhals_daft', group='narwhals.plugins'),)

The name of the entrypoint (here, narwhals-daft) can serve as the name for the Implementation (Implementation.NARWHALS_DAFT), and we impose that plugin names must be unique (i.e. you can't install two plugins with the same name for the 'narwhals.plugins' entrypoint)

@dangotbanned @FBruzzesi as you've both been looking at Implementation recently

MarcoGorelli avatar Aug 27 '25 09:08 MarcoGorelli

I don't think we need this.

Implementation makes sense as an Enum because it allows branching on specfic backends when working downstream.

The idea I had in (#2786) would mean passing in any object that can provide __narwhals_namespace__ - but with whatever constraints the call has.

E.g. if we want a scan_parquet, then calling this method is much simpler and works within the type system

namespace: DaftNamespace = obj.__narwhals_namespace__(...)

namespace.scan_parquet(...)

We can define all of that without explicitly naming any plugins or changing Implementation

dangotbanned avatar Aug 27 '25 10:08 dangotbanned

sure, and what do you propose doing to:

 implementation = Implementation.from_backend(backend) 
 native_namespace = implementation.to_native_namespace() 

when backend is say 'narwhals-daft'? And how would

df: nw.LazyFrame # `df` is a narwhals LazyFrame backed by Daft

nw.scan_parquet(file, backend=df.implementation)

work?

MarcoGorelli avatar Aug 27 '25 10:08 MarcoGorelli

And how would

df: nw.LazyFrame # df is a narwhals LazyFrame backed by Daft

nw.scan_parquet(file, backend=df.implementation)

work?

I'm just answering this one for now as it is simpler 😄

BaseFrame already satisfies the protocol

https://github.com/narwhals-dev/narwhals/blob/82052e24cb27aee51cd6cc357ed2d1f680b1807c/narwhals/dataframe.py#L107-L115

So we could just write:

df: nw.LazyFrame # `df` is a narwhals LazyFrame backed by Daft

nw.scan_parquet(file, backend=df)

Which then internally does:

def scan_parquet(
    source: str, *, backend: IntoBackend[Backend], **kwargs: Any  # `IntoBackend` can be extended
) -> LazyFrame[Any]:
    implementation = Implementation.from_backend(backend)
    if implementation is not Implementation.UNKNOWN:
        ns = Version.MAIN.namespace.from_backend(implementation).compliant
    elif supports_narwhals_namespace(backend):
        ns = backend.__narwhals_namespace__()
    else:
        raise TypeError
    return ns.scan_parquet(source, **kwargs)

dangotbanned avatar Aug 27 '25 10:08 dangotbanned

Sure, we can write something else, but

nw.scan_parquet(file, backend=df.implementation)

would not be supported, right?

MarcoGorelli avatar Aug 27 '25 10:08 MarcoGorelli

Sure, we can write something else, but

nw.scan_parquet(file, backend=df.implementation)

would not be supported, right?

I'm okay with this.

If that has to be supported, when there is an alternative then two other ideas are:

  1. Resolve the Implementation.UNKNOWN inside scan_parquet by interfacing with EntryPoints i. I'm guessing the equivalent of this would need to happen anyway, even if Implementation.NARWHALS_DAFT worked
  2. Following (#3016), define behavior inside the descriptor to return something instead of Implementation.UNKNOWN on df.implementation when an extension is installed

dangotbanned avatar Aug 27 '25 11:08 dangotbanned

sure, makes sense, thanks

i'm also ok with not supporting that if alternatives are available (we can always leave a note about this in the docstring), at least for now

scan_parquet(file, backend='narwhals-daft') which then gets resolved directly to a compliant namespace (bypassing Implementation completely) seems fine too

MarcoGorelli avatar Aug 27 '25 11:08 MarcoGorelli

scan_parquet(file, backend='narwhals-daft') which then gets resolved directly to a compliant namespace (bypassing Implementation completely) seems fine too

Yep, couldn't have said it better myself 😄

I'm just answering this one for now as it is simpler 😄

dangotbanned avatar Aug 27 '25 11:08 dangotbanned

Also I think Implementation.UNKNOWN might be a nice invariant to have for if someone explicitly doesn't want to support extensions.

Not sure how common that may be - but it's fairly simple with the current Enum:

import narwhals as nw

def what(backend: nw.Implementation):
    if backend is not nw.Implementation.UNKNOWN:
        ...

    if backend is nw.Implementation.UNKNOWN:
        ...

Or with new sugar:

def what(backend: nw.Implementation):
    if backend.is_known():
        ...

    if backend.is_unknown():
        ...

dangotbanned avatar Aug 27 '25 12:08 dangotbanned

Hellooo 👋🏼 Apologies for the late feedback on this - My days at work are a bit hectic.

I have been thinking a bit on how to extend Implementation but I don't have a concrete solution, yet please bear with my half backed idea, which might inspire further iterations (or the final decision to thresh it)

The idea is to have one of the Implementation member as an extension registry. This is totally allowed by an enum:

code - but actually it's better to jump to the "slight variation"
from __future__ import annotations

from collections import UserDict
from enum import Enum
from types import ModuleType

import pandas as pd
import polars as pl

class ExtensionRegistry(UserDict):
    active: None | str = None

    def activate(self, module_name: str) -> None:
        if module_name not in self.data:
            msg = f"Unknown extension '{module_name}'"
            raise ValueError(msg)

        self.active = module_name

    def to_native_namespace(self) -> ModuleType:
        if self.active is None:
            raise
        return self.data[self.active]

class Implementation(Enum):
    PANDAS = "pandas"
    PYARROW = "pyarrow"
    UNKNOWN = "unknown"
    EXTENSION = ExtensionRegistry()

    @classmethod
    def register_extension(
        cls,
        module_name: str,
        module: ModuleType,
    ) -> None:
        if module_name in Implementation:
            msg = f"Cannot overwrite '{module_name}'"
            raise ValueError(msg)

        if module_name in Implementation.EXTENSION.value.data:
            msg = f"'{module_name}' already registered"
            raise ValueError(msg)

        cls.EXTENSION.value.data[module_name] = module

    @classmethod
    def from_string(cls, backend_name: str):
        try:
            return cls(backend_name)
        except ValueError:
            if backend_name in Implementation.EXTENSION.value:
                impl = Implementation.EXTENSION
                impl.value.activate(backend_name)
                return impl
            else:                
                return Implementation.UNKNOWN

    def to_native_namespace(self) -> ModuleType: ...

def read_pq(filename: str, impl: Implementation) -> str:
    if impl is Implementation.EXTENSION:
        return impl.value.to_native_namespace().read_parquet(filename)
    else:
        # ok in this specific scenario, this line would fail since Implementation
        # does not have `to_native_namespace`, but we have it in Narwhals 😂
        ns = impl.to_native_namespace()
        return ns.read_parquet(filename)
        

filename = "foo.pq"
pl.DataFrame({"a": [1,2,3]}).write_parquet(filename)

Implementation.register_extension("polars", pl)

pd_impl = Implementation.from_string("pandas")
pl_impl = Implementation.from_string("polars")

print(pd_impl, pl_impl)
Implementation.PANDAS Implementation.EXTENSION

print(read_pq(filename=filename, impl=pl_impl))
shape: (3, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘

Disclaimers

I am happy you made it this far 🎉 Now it's time for disclaimers:

  1. As we have quite a few edge cases I am not fully sure this would work in all scenarios
  2. I don't know how much overhead this adds
  3. I am sure that some attribute we can sugar code e.g. Implementation.EXTENSION.value.data as a extension_registry property or something.
  4. Of course we can register more information that the name and namespace/module
  5. I am not sure how it would work if multiple external packages try to register together - on this point, maybe we could load the entrypoints in the registry to being able to access the module from the implementation.

Slight variation

Slight variation with a dedicated class ExtensionImplementation (which is not an enum but has the same methods of Implementation). Honestly this looks nicer to me because the one above has:

  1. Inconsistent Interface: Implementation.EXTENSION behaves differently from other enum members.
  2. State Management: The registry has mutable state (active) that feels a bit awkward in an enum context.
class ExtensionRegistry(UserDict):
    def register(self, module_name: str, module: ModuleType) -> None:
        if module_name in self.data:
            msg = f"Extension '{module_name}' already registered"
            raise ValueError(msg)
        self.data[module_name] = module

    def get_module(self, module_name: str) -> ModuleType:
        if module_name not in self.data:
            msg = f"Unknown extension '{module_name}'"
            raise ValueError(msg)
        return self.data[module_name]

    def is_registered(self, module_name: str) -> bool:
        return module_name in self.data


class Implementation(Enum):
    PANDAS = "pandas"
    PYARROW = "pyarrow"
    UNKNOWN = "unknown"

    _extensions = ExtensionRegistry()

    @classmethod
    def register_extension(cls, module_name: str, module: ModuleType) -> None:
        # Check against built-in implementations
        if module_name in Implementation:
            msg = f"Cannot overwrite built-in implementation '{module_name}'"
            raise ValueError(msg)

        cls._extensions.value.register(module_name, module)

    @classmethod
    def from_string(cls, backend_name: str):
        # Try built-in implementations first
        try:
            return cls(backend_name)
        except ValueError:
            # Check extensions
            if cls._extensions.value.is_registered(backend_name):
                return ExtensionImplementation(backend_name, cls._extensions)
            else:                
                return Implementation.UNKNOWN

    def to_native_namespace(self) -> ModuleType:
        """Get the native module for built-in implementations"""
        if self == Implementation.PANDAS:
            import pandas as pd
            return pd
        elif self == Implementation.PYARROW:
            import pyarrow as pa
            return pa
        else:
            raise ValueError(f"No native namespace available for {self}")

class ExtensionImplementation:
    """Wrapper for extension implementations to provide consistent interface"""
    def __init__(self, name: str, registry: ExtensionRegistry):
        self.name = name
        self._registry = registry

    @property
    def value(self) -> str:
        """Return the name to match enum interface"""
        return self.name

    def to_native_namespace(self) -> ModuleType:
        """Get the registered module"""
        return self._registry.value.get_module(self.name)

    def __str__(self) -> str:
        return f"ExtensionImplementation.{self.name.upper()}"

    def __repr__(self) -> str:
        return f"ExtensionImplementation('{self.name}')"


def read_pq(filename: str, impl: Implementation | ExtensionImplementation) -> str:
    ns = impl.to_native_namespace()
    return ns.read_parquet(filename)


# Register extensions
Implementation.register_extension("polars", pl)

# Get implementations
pd_impl = Implementation.from_string("pandas")
pl_impl = Implementation.from_string("polars")
unknown_impl = Implementation.from_string("nonexistent")

print(f"Pandas: {pd_impl}")
print(f"Polars: {pl_impl}")
print(f"Unknown: {unknown_impl}")
Pandas: Implementation.PANDAS
Polars: ExtensionImplementation.POLARS
Unknown: Implementation.UNKNOWN

print(read_pq(filename=filename, impl=pl_impl))
shape: (3, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘

Sorry for the flow of consciousness debating against myself

FBruzzesi avatar Aug 27 '25 20:08 FBruzzesi

Chiming in with some (minimalistic) thoughts here.

Enums are designed to not be mutable at runtime, while we can circumvent this (e.g. mutable data structure as enum member) it creates some surprising code that we need to maintain, plug-in authors need to reason about, and our users need to reach for.

With that said, there is a need to provide the user something to "grab onto" when working with various Narwhals functions (e.g. it is sensible for a user to reach for Implementation.PANDAS when they want a function to be carried out via pandas code).


To keep things simple, what would happen if we didn't include Implementations for plug-ins? Instead we can just have users rely on the import so that functions like scan_parquet can rely on a ModuleType check (e.g. if the user passed a module, directly attempt to call its scan_parquet function).

import narwhals as nw
import narwhals_daft

nw.scan_parquet(…, backend=narwhals_daft)

Pros:

  • Remove constraints on downstream plugin authors to need to interact with any registry mechanisms (which may change over time)
  • Reduces the number of objects that users need to be aware of

Cons:

  • Additional codepath for testing/maintenance
  • Plugins follow a different codepath than internalized backends

Open Ended:

  • What happens if a Plugin needs to use Implementations internally (e.g. a plugin supports multiple backends in the same way that _pandas_like does?

camriddell avatar Sep 02 '25 15:09 camriddell