splink [MAINT] `linker.py` cleanin/refactoring

Is your proposal related to a problem?

Tags onto this conversation about cleaning up our settings class - https://github.com/moj-analytical-services/splink/issues/1669.

I think we should sit down and work out how we can start to clean up linker.py, which has grown to nearly 4000 lines at this point.

Admittedly, a lot of this comes from docstrings we've put in, however, it feels as if a lot of the code could be migrated away to a new home and then inherited.

Some quick thoughts on what we might want to change/remove:

Migrate out the SQL execution and logging.
Trim out a lot of the settings and df validation steps into their own areas (settings startup feels like a good place).
Lots of properties could be moved to a new home. For example, anything to do with the input dataframe(s) could be housed in a new class, which should hopefully improve readability - https://github.com/moj-analytical-services/splink/blob/master/splink/linker.py#L291.

My view is that the linker class should only contain methods we wish to call and everything else should be housed in their own separate, but sensible areas.

This should make everything easier to find, test and add to going forward.

Oct 27 '23 11:10 ThomasHepworth

Am very much in favour of this. At a bare minimum I think splitting up the module into sensible grouped chunks would be handy, which could be achieved as you say via mixins:

# splink.linker.py
from splink.linker.charts import _LinkerChartsMixin
from splink.linker.charts import _LinkerTrainingMixin

class Linker(_LinkerChartsMixin, _LinkerTrainingMixin):
    ...

However I am interested also in somewhat breaking up the Linker object itself, by nesting methods under related classes which live as attributes on the linker (something me and @zslade discussed a little in regards to #1538). What I mean is something like:

# splink.linker.charts.py

class LinkerCharts:
    def __init__(self, linker: Linker):
        ...

    def unlinkables_chart(self, x_col='match_weight', source_dataset=None, as_dict=False):
         ...

and

# splink.linker.py
from splink.linker.charts import LinkerCharts

class Linker:
    def __init__(self, ...):
        self.charts = LinkerCharts(self)

so that we instead have from a user POV

linker.charts.unlinkables_chart()
# instead of the old way:
linker.unlinkables_chart()

The only things that live directly on the linker are the 'behind-the-scenes' stuff that these other bits of functionality require.

I realise this means some code gets a bit longer, but if you are like me and have difficulty remembering the names of some methods, it allows a way to more quickly narrow down your search when using autocomplete (typing linker.training. will leave you with a lot fewer options than the load you get with the current linker.). It also means:

it is potentially easier to add methods without worrying about further bloating Linker
- it could be easier to add some niceties like linker.charts.available_charts() or w/e

Of course this is quite a severe backwards-incompatibility #1644, and might require a bit of careful planning about which attribute-classes know about one another to keep functionality. Also have some thoughts re: the SQL execution part, but will post that separately

Oct 30 '23 15:10 ADBond

Tangentially related, but I think there might also be cases where we want to leverage some of the functionality of the Linker without necessarily needing a 'full one'. For example, in relation to working on #1001 I have found it useful to read in 'edges' and 'clusters' tables, and create a cluster_studio_dashboard. Currently I have to pass in some dummy data to create the linker, as well as a settings dict (which I am not necessarily interested in for my case). I feel like this sort of use-case may not be too unusual, so maybe there is a way to leverage this ☝️ proposed modularity to make things like that easier. Also thinking of things like profiling your data without needing a settings dictionary to do so, without keeping complicated logic / validation in the linker (in the spirit of #1055).

i.e. for the right choice of submodules maybe we can make something like a LinkerProfiling which only has the profiling functionality, without needing settings.

Might all be too complicated for fairly niche use-cases, but perhaps worth thinking about whether or not we can naturally structure things in such a way that we get this sort of thing more-or-less 'for free'

Oct 30 '23 16:10 ADBond

I've made a mini-working sketch of the database api object suggested in this comment, which would take over some of the functionality currently tackled directly by the linker. I've just focussed on what seems to me the 'main' SQL execution methods (_execute_sql_against_backend, _log_and_run_sql_execution, and _run_sql_execution), and there are a few ideas here - mostly based around the fact that I get a bit confused with the current implementation:

which of these I need to implement in a subclass, and quite what they should do
which I should be accessing as a user of the class (from e.g. other linker methods)

So there are a couple of ideas here - there is certainly a bit of complexity going on in the base class, but my aim was to try and 'soak up' a lot of complexity there, and make things simpler for subclassers / clients of subclasses. Specifically:

_log_and_run_sql_execution now happens automatically with some decorator/metaclass stuff, as this method is directly tied to _run_sql_execution, to hopefully make it clearer which methods should be called
_execute_sql_against_backend is not directly overridden, but broken up into setup/cleanup functions which are - so that subclassers don't have to directly touch any of the methods that actually run SQL, partly so there is no confusion between the 'bare' method and the 'wrapped' version above [not sure if this plays well with athena]
generally trying to bottleneck things so that all execution goes through a fixed path
there is an api-factory: db_api("duckdb")("dbfile.db") - the double brackets is not so nice, but is mainly done to allow help with tooltips for the varying arguments you will get depending on your flavour.

There may well be additional complications when it comes to a more full implementation, but hopefully at least some of the ideas here might prove useful.

Nov 07 '23 18:11 ADBond

Has been split apart in Splink 4, closing

Jul 24 '24 18:07 RobinL

splink splink copied to clipboard

[MAINT] `linker.py` cleanin/refactoring

Is your proposal related to a problem?

splink
splink copied to clipboard