dedupe setuptools plugin solution for variables

Using setuptools plugin facilities to set up variables. will close #1085

external plugins

external dedupe-variable plugins will look like https://github.com/dedupeio/dedupe-variable-datetime

in particular, they will have this in their pyproject.toml (or equivalent setup.cfg/setup.py)

[project.entry-points]
dedupevariables = {datetimetype = "datetimetype:DateTimeType"}

https://github.com/dedupeio/dedupe-variable-datetime/blob/fcd36afd000641168d9ae369623df866eeac35f9/pyproject.toml#L16

to do

[x] https://github.com/dedupeio/dedupe-variable-datetime
[ ] https://github.com/dedupeio/dedupe-variable-name
[ ] https://github.com/dedupeio/dedupe-variable-address
[ ] https://github.com/dedupeio/dedupe-variable-fuzzycategory

Dec 10 '22 20:12 fgregg

Coverage increased (+0.08%) to 73.364% when pulling 784bf6361c472af9bae18260fed4a9aa9d6995f6 on pluggy-variables into 3666d3a3a29b76ef88da707042dc2d582262783a on main.

Dec 10 '22 21:12 coveralls

@NickCrews so here's an implementation inspired by how datasette uses pluggy. would appreciate your thoughts on this.

Dec 11 '22 01:12 fgregg

Unfortunately I don't think that pluggy was actually a good suggestion from me, sorry :(

I think the largest drawback is the boilerplate that you describe above that plugin authors will need.

Second, I don't like how this isn't easily testable. At least I can't see how to install a mock package that registers an entry point. I'm assuming you didn't either, and that's why there are no tests here. When I see commits where someone bumps some external package to check for a change in behavior I get nervous lol

The actual implementation here looks good though if this still is the direction you want to head.

Dec 11 '22 08:12 NickCrews

there are reasonably test pattern,

but i feel you on the boilerplate. it’s about the same amount and much clearer boilerplate than the namespace solution.

hmm…

Dec 11 '22 11:12 fgregg

the core of way that pluggy works with external libraries is with the setuptools mechanism.

    def load_setuptools_entrypoints(
        self, group: str, name: Optional[str] = None
    ) -> int:
        """Load modules from querying the specified setuptools ``group``.

        :param str group: Entry point group to load plugins.
        :param str name: If given, loads only plugins with the given ``name``.
        :rtype: int
        :return: The number of plugins loaded by this call.
        """
        count = 0
        for dist in list(importlib_metadata.distributions()):
            for ep in dist.entry_points:
                if (
                    ep.group != group
                    or (name is not None and ep.name != name)
                    # already registered
                    or self.get_plugin(ep.name)
                    or self.is_blocked(ep.name)
                ):
                    continue
                plugin = ep.load()
                self.register(plugin, name=ep.name)
                self._plugin_distinfo.append((plugin, DistFacade(dist)))
                count += 1
        return count

https://github.com/pytest-dev/pluggy/blob/450b85178d0ea2db4ecf75f6abfbb97e1164d5ab/src/pluggy/_manager.py#L352-L377

Reading setuptool's docs for this, i think we can do something much more direct than this current pluggy approach.

it will look similar to @NickCrews's proposal, but the developer will only have to write the pathstring once (for the entrypoint), instead of the user having to do it multiple times.

Dec 11 '22 11:12 fgregg

okay, i'm liking this solution a lot better than the pluggy one. thanks for pushing against that @NickCrews!

i think it should be a bit more user friendly than the full path model, but almost as convenient for developers hacking one something.

i'm going to try this out on a local project that needs a custom variable type and if i still like it, i'll write some tests and bring it in.

Dec 11 '22 12:12 fgregg

This is much better than pluggy, awesome!

I think the basic philosophical difference between this and #1122 is who has the responsibility of defining the lookup key, and whether namespacing with the package name is important. Here, when the author of plugin A writes

[project.entry-points]
dedupevariables = {zipcodetype = "pluginA:ZipCodeVariable"}

they decide that the way to get this plugin is via the key "zipcodetype". You don't have to know that it lives inside a package called pluginA. If there is another plugin B that wants to do basically the same thing, they can't say dedupevariables = {zipcodetype = "pluginB:ZipCodeVariable"} because the key zipcodetype is now ambiguous, do you mean from pluginA or pluginB?

Whereas in #1122, the name of the package, pluginA vs pluginB, is significant, and is used to namespace. Now you can have two different plugins that export variables of the same name. On PyPI pluginA and pluginB are already forced to be using different names, so we can piggyback off that guarantee and now you can't have accidental collisions.

What do you think, are you trying to avoid having to specify "pluginA" vs "pluginB" in your variable definition dicts? I think that is silly, you already know where ZipCodeVariable is coming from, you have the package name written down in your requirements.txt. Am I missing something here? Note this is different from wanting to not care about WHAT FILE in the plugin , which I agree with the user shouldn't have to specify. But plugin authors shoud get around that with what I mention in https://github.com/dedupeio/dedupe/pull/1122#issuecomment-1345801485

Dec 12 '22 03:12 NickCrews

i think my main reason for preferring this is consistency of interface. i don't like that there are some variables that we refer to with a short alias and others that we use a path in the datamodel.

we could have a deprecation path where the paths were the way to reference variables, but honestly i think that a lot of new people would be confused by that.

if we were going to go down that road, i might even prefer a datamodel that went from:

data_model = [{field: "name", type: "String"},
              {field: "address", type: "String"}]

to

from dedupe.variables import StringVariable
data_model = [StringVariable(field="name"), StringVariable(field="address")]

which would avoid the plugin question altogether.

i kind of like that solution, but it seems like a lot of disruption without a huge benefit to users.

hmmm....

Dec 12 '22 04:12 fgregg

closed by #1193

Jun 27 '24 05:06 fgregg