ENH: Support Plugin Accessors Via Entry Points
TLDR: Allows external libraries to register accessors for pandas objects (DataFrame, Series, Index) using the 'pandas.<pd_objs>.accessor' entry point group. This enables plugins to be automatically used without explicit import.
I'm working on this PR collaboratively with @afonso-antunes .
- [X] closes #29076
- [x] Tests added and passed if fixing a bug or adding a new feature
- [x] All code checks passed.
- [ ] Added type annotations to new arguments/methods/functions.
- [x] Added an entry in the latest
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.
Proposal
We propose implementing an entrypoint system similar to Vaex (#29076) to allow easy access to the functionalities of any installed plugin without requiring explicit imports. The idea is to make all installed packages available for use, only being "imported" when they are needed in the program, in a seamless manner.
Current Behavior
Currently, each plugin must be explicitly imported:
import pandas as pd
import vaex.graphql # required to enable .graphql (.graphql is compatible with pd.DataFrames)
df = pd.DataFrame(...)
df.graphql.query(...) # only works after the import
Proposed Behavior
With our feature implemented, the code would be simplified to:
import pandas as pd
df = pd.DataFrame(...)
df.graphql.query(...) # works directly if the plugin is installed via pip
Most of the errors are from:
from importlib_metadata import entry_points
I'm personally happy to add this, but it's kind of a big change in terms of the code users can write. @pandas-dev/pandas-core, thoughts here?
If this moves forward, you'll want to fix the CI and add documentation for this.
I'm personally happy to add this, but it's kind of a big change in terms of the code users can write. @pandas-dev/pandas-core, thoughts here?
If this moves forward, you'll want to fix the CI and add documentation for this.
Hard to understand without documentation that illustrates a use case.
Hard to understand without documentation that illustrates a use case.
We allow creating pandas accessors with register_dataframe_accessor. For cyberpandas for example, when you do import cyberpandas they'd call that, and you'll be able to use the accessor:
df.ip.is_ipv6
If we allow registration via Python entry points, the import cyberpandas won't be needed anymore. On import pandas we will check the packages the user has installed in their environment that provide an accessor, and we will register them automatically.
Same idea as what PDEP-9 proposed for the read_* and to_* functions/methods if you remember that.
On
import pandaswe will check the packages the user has installed in their environment that provide an accessor, and we will register them automatically.
Couldn't that be really expensive if lots of packages were installed in the environment? What if there were conflicts in naming?
Agreed this would need docs, but I'm generally +1 on using entry points rather than import-time side effects.
Couldn't that be really expensive if lots of packages were installed in the environment?
It's scoped to projects that declare an entrypoint, so it doesn't scale with every package installed. But it would be good to measure the performance impact on import pandas here, both for an environment with several and without any entrypoints installed.
What if there were conflicts in naming?
That's probably defined somewhere in importlib, but the same issue would affect plugins using import-time registration today.
Since the implementation seems now stable, would it make sense to start working on the documentation already, or would you prefer we wait for further maintainer input?
What if there were conflicts in naming?
I think by default, the last package found for the entry point with that name will overwrite the previous. As Tom says, this is the same as with imports now, the second import will overwrite the accessor of the first. But we have control over it when registering the entry points. We could keep the behaviour but show a warning, raise an exception and ask the user to remove one of the packages (probably not a great option), let the user decide which package has higher priority in the config... Since this should be very rare, I would go for the simplest solution that doesn't "fail" silently, which would be show a warning saying something like Both packageA and packageB provide the accessor foo. packageA is being used, please uninstall the package you don't want to use to remove this warning.
would it make sense to start working on the documentation already, or would you prefer we wait for further maintainer input?
Up to you @PedroM4rques. The more complete is this PR the easier is for everybody to understand what it's proposed. But if at the end there is no agreement to add this, you'll be spending time in a PR that won't get merged.
I think by default, the last package found for the entry point with that name will overwrite the previous
From what I could test locally, this is true.
We could keep the behaviour but show a warning [or] raise an exception and ask the user to remove one of the packages
I think raising an exception is the better approach, as this is a critical error. It’s likely a rare scenario, and if it does occur intentionally, the user can always handle it explicitly (try-catch and pass for example). I think the plugin system would be safer this way.
I can also imagine that it's possible implement a system where the user chooses the plugins to throw away in the catch block, would that be desirable?
I don't think we should raise an exception. Imagine a case where someone workimg with dna has two packages installed that provide a dna accessor. The user doesn't even care about the accessors, it's using the packages independently of pandas. Raising meams that the user needs to uninstall one of the packagea they need in order to use pandas. It doesn't make any sense in my opinion. Ideally we would just inform which accessor pandas will use, in case the user cares. And how to change it if needed. Which probably should be with an option, but since at present is an extfemely rare scenario I wouldn't make things complex to implement it.
Raising means that the user needs to uninstall one of the packages they need in order to use pandas
I agree, that wouldn't make any sense.
Unless there are any objections, we'll implement the warning system.
Are there other mainstream packages using entrypoints?
Yeah, xarray, fsspec, pytest are a few.
Entry points can be a good option anytime you have som sort of plugin system that requires coordinating how a "framework" (pandas in this case) loads code provided by a plugin.
We already use entrypoints in pandas for the plotting backends. Besides what Tom said, if I'm not wrong many projects using commands (e.g. jupyter <command>, black <command>, flake8 <command>) are implemented with entrypoints. Airflow plugins are also using them. I don't think they are super popular, but surely not experimental or rare.
Thanks @TomAugspurger. Looking at prior art, all options to deal with collisions exist.
My personal preference would be to warn.
Above, I raised 3 issues:
- Documentation is needed
- Concern about performance when people don't have packages using entry points, and
import pandas as pdjust takes longer because there are a lot of packages installed. This probably has more to do with the performance ofimportlb.entry_points()than anything else. - Concern about duplicates and conflicts.
I don't think (1) or (2) have been discussed.
For (3), the suggestion of warning if a conflict occurs is fine with me.
Tom commented about 2. The entry points are a registry. I think the cost is just a lookup of the entry point name in a hash table. It shouldn't depend on the amount of packages installed. So it's just the loop over the packages that register an accessor that exist in the user environment. Even if this becomes popular, I'd be surprised the number is more than around 5. I don't think there should be any impact in practice.
But worth benchmarking, better to be sure.
Hi @noatamir !
We're hitting a small build issue. There's a broken reference in the docs that's causing a error in the pipeline. I can't really understand what's happening tbh.
As the code is basically done, and I need your review anyway, could you please take a look at it?
Ty!
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.