splink icon indicating copy to clipboard operation
splink copied to clipboard

[FEAT] Cross-Backend / Lazily Evaluated Comparisons and Blocking Rule builders

Open ThomasHepworth opened this issue 1 year ago • 8 comments

Is your proposal related to a problem?

See this closed issue for some further background, though it's not necessary - https://github.com/moj-analytical-services/splink/issues/1670.


The Problem

Currently, we face the following challenges:

  1. There is a significant amount of repetitive code involved in creating backend-specific functions. An example of this can be found in duckdb_comparison_imports.py, which is one of five similar scripts in use.
  2. Users are required to import comparisons and blocking rules directly from a specific SQL dialect's folder, such as import splink.duckdb.comparisons as cl.

Note (from Robin): this is primarily a tech debt problem that is affecting our ability to maintain the codebase, not a user experience problem. i.e. the improvements to user experience are probably minor and wouldn't justify a large amount of work. We have something like 4,000 lines of code on comparison_levels, comparisons_library and comparison_template_library which I think can be very substantially reduced in lines of code and complexity.

After discussions with Robin, we are considering two potential solutions:

  • Implementing backend-agnostic functions that are evaluated only when needed. These would be dynamic, creating the functions on the fly when given a SQL dialect.
  • Creating factories that construct the required functions for each backend and make all functions accessible to users.

Potential Solutions

Backend Agnostic Functions

Backend agnostic builders would allow us to change the current functionality from:

from splink.duckdb.comparisons import exact_match
exact_match("help")

to a simplified:

from splink.comparisons import exact_match
exact_match("help")

Initially, exact_match will be created without specifying a backend dialect. It will then be fully generated when passed through the linker's settings class and given a dialect.

To achieve this, we propose a combination of factories and methods for the dynamic generation of our classes when SQL dialect information becomes available.

Advantages:

  • Eliminates the need to write docstrings for each dialect.
  • Reduces excessive code duplication present in the comparison_imports scripts, simplifying the codebase.
  • Offers users simplified access to functions.
  • Enables rapid switching between linkers. Changing the linker variant is all that's required; the comparisons and blocking rules in your settings will function without further adjustments (a valuable feature for our DL pipelines).

Disadvantages:

  • It's not immediately apparent which classes/functions are available for each backend without consulting the documentation. We may consider adding a table in Splink that showcases the available options for each backend.
  • It introduces a minor break in backward compatibility, although the impact is minimal.

As part of this, I've drafted up a potential solution here, showing off how this could work for comparisons. There's also a solution for exact_match_rule blocking rules class, which is easier to read, but is too simple to offer a full picture.

This solution works by delaying creation of the class until a sql_dialect argument is passed. This can be passed through:

  1. Inheritance - i.e. pass it down through our base classes
  2. Setting it manually - my_comparison_class.sql_dialect = "duckdb".

Moreover, this example illustrates how this approach would prevent backends without specific functionality from generating classes. This is achieved by exclusively associating certain functions, such as demerau_levenshtein, with base classes that are compatible with them. Attempting to generate a function lacking this property will result in an error.

It's not a perfect solution, but this solution offers a framework that requires minimal rewriting of the existing code. We can drag and drop a lot of the existing code, wrapping it in factory classes that allow us to dynamically create the required classes on the fly.

Non-agnostic solution w/ Factories

Alternatively, if the original from splink.duckdb.comparisons import exact_match ends up being the preferred way to access these helper functions, we can use factories.

Factories could be used to produce all of our required backend functions, without the need for excessive boilerplate and copy+pasting.

For example, here's what the code could look like on the blocking side - blocking_rule_imports.py.

Here, we simply need to specify the dialect in the factory - _blocking_rule_functions = _blocking_rule_library_factory("duckdb") and then all of the background code generates our classes for users.

Pros and cons vs the current methodology are very similar to those above. So, a quick pros vs cons against the backend agnostic approach...

Pros:

  • Imports can be inferred by intelli-j
  • Imports feel more natural(?) - I'm not sure this is true.
  • Factory only needs to be run once to setup the functions. If we have them embedded in the linker class then I think you’d need to scan each comparison, comp level, br, etc and then construct it.
  • We can print out the correct sql string (as discussed)

Cons:

  • Docstring hell… <- though, we could try to trim the docstrings so they’re only relevant for the given dialect.
  • Lots of different files… more admin for us. More areas of failure. It's also more difficult to add additional linkers.
  • Far less flexible. More paths to change if you wish to swap linkers.

ThomasHepworth avatar Oct 27 '23 11:10 ThomasHepworth

I've had a go at a solution to this here. It's similar to your 'backend agnostic functions' but possibly even simpler.

There is a demo script here showing how it may be used, and demonstrating the behaviour.

The above is deliberately an absolute minimalist demonstration of the concept to make it easy to understand what's going on. I've mocked the Linker, DuckDBLinker and settings to keep things as short and readable as possible.

Here are a couple of more realistic extensions that I did to check the solution works when things get a little more complicated: More realistic/complicated levels Add blocking

Elements of the solution:

Pros

This solution should completely eliminate the need for backend-specific comparison levels, comparisons, or comparison templates. Many of the files in the various backend folder e.g spink.duckdb can simply be deleted. There's no need for any inheritance hierarchies, or the complicated patterns of imports we've needed to do to make sure the right levels are exposed in each backend (from splink.duckdb import ...)

  1. I have deliberately chosen to write the comparison levels (e.g. levenshtein_level) as standard functions rather than classes. I think this makes them easiest to understand, so that external contributors unfamiliar with the larger codebase could contribute another one fairly easily.
  2. To separate concerns, I wanted these comparison level functions to not need to know anything about the implementation of the ComparisonLevel class and simply adhere to the jsonschema spec for a comparison level. I think this makes them conceptually simpler because all the function needs to do is create a dict.

Re (2) I think one of the drawbacks of your suggested BlockingRule class is that it mixes two things that can be kept separate conceptually: The BlockingRule class that wraps the jsonschema for the blocking rules, and the task of generating the sql representing a blocking rule.

To try and clarify, the separation here is:

  • A blocking rule is fundamentally just a SQL string (occasionally it has salting too). How that SQL string is generated is not necessarily the concern of the BlockingRule class. At least, I don't think it needs to be - I think a simple function suffices. I don't think the BlockingRule class needs, for instance, to know its dialect; that's implicit in the sql.
  • The BlockingRule class contains data (the sql string) and performs calculation (methods) on the SQL string such as figuring out what the equi-join conditions are, etc.

Cons

  • Having comparison levels as standard functions means a three lines of boilerplate is needed in each function.
  • It's not completely obvious how we check that a given comparison level is implemented against a backend but I think something like this should suffice.

Possible room for improvement

There may be room for improvement for what happens when the user asks for a ComparisonLevel method on a LazyComparisonLevelFactory. At the moment it produces a warning. One thing I tried was for get_dialected_level() to self-mutate the LazyComparisonLevelFactory into a ComparisonLevel but this seems to be not recommended in python. I think it’s probably good enough as-is.

RobinL avatar Oct 28 '23 18:10 RobinL

I've had a second go at this using a class based approach to see if it can eliminate boilerplate, and I don't think it helps. Solution here for reference.

RobinL avatar Oct 30 '23 09:10 RobinL

I'm totally in favour of this - have been thinking about this for a while. I'll post some thoughts on other aspects related to slight API changes in a bit, but for the moment I thought I'd share my take on the comparisons aspect.

I have a very minimal sketch here. It's fairly similar to Robin's approach (particularly the second one), but I avoid the boilerplate by pushing the laziness all the way up to ComparisonLevel. This essentially means that they do not actually have an associated dialect at any point - it is only when they are used in the context of a Linker that they 'know' about a dialect, in which case we construct the appropriate SQL expression. I have had a look over the existing code and I haven't noticed anywhere that we really need the exact SQL when it is not associated to some linker, but do let me know I am mistaken on that point!

In my version additionally nothing is ever simply a ComparisonLevel - any versions that we just make as a custom SQL expression is a CustomLevel, which can have an associated dialect (if it is not universal). This could potentially allow us to leverage sqlglot's transpiling when it is used in a different dialect's linker.

I have written this in terms of classes, but there is no particular reason it couldn't work via a function-based approach. However, I would advocate a class-based approach (but I realise that I may swing a bit OOPier than some!), as:

  • I personally don't think that there is too much overhead for contributors in doing things this way instead of constructing a conformant dict and creating a ComparisonLevel once we have stripped away almost all of the multiple-inheritance layers (though realise I am biased on this front).
  • I even think the resultant code may be a bit clearer, but that may be a matter of taste
  • most importantly, though, I think that it may sometimes be useful to know what type of comparison level we are dealing with (e.g. is it a levenshtein level of some kind?) which I think is cleaner via inheritance (rather than say storing an additional attribute on the ComparisonLevel). I'm thinking of things like #1210, but I can also imagine other situations where this may prove useful (perhaps exotic features like auto-tuning fuzzy match thresholds, or fancier validation where we could warn users if they get levels the wrong way round [like having levenshtein <= 2 before levenshtein <= 1] or use multiple different fuzzy levels)

Happy to discuss this further though!

ADBond avatar Oct 30 '23 15:10 ADBond

I think we're now in a good place: We have several different proposed solutions, all of which can successfully solve the core problem of getting rid of backend-specific comparison levels and associated complexity. Any of these solutions will be a big improvement.

We need a way of deciding which solution (or more likely what combination of elements) is best.

I'd like to us to think about some criteria on which to make the decision.

Here are some thoughts:

  1. Ease of adding new comparison levels
  2. Functionality. Do we have specific examples of desirable functionality that may differ between implementations. For example, context aware comparison levels, ease of error messaging when e.g. a comparison level is not available for a dialect.
  3. Maintainability. e.g. a class based approach potentially allows for additional methods to be added for particularly tricky-to-construct comparisons
  4. Boilerplate/lines of code. e.g. what would a PR that added a new comparison_level look like? Is it confusing or clear? What errors may the user make?
  5. Ease of implementation (how much change do we need to make to Splink)

Any others?

RobinL avatar Oct 30 '23 17:10 RobinL

Here's another class based approach which tries to combine @ADBond 's approach and my approach.
https://github.com/RobinL/comparison_test/blob/e556a4b04e492daa7ff993428c27f3f693e75b45/splink/init.py

(In a real implementation, I think Andy's ideas around having further classes e.g. a dialect class could be useful, this was just to illustrate 'pushing up' the laziness to a parent class that isn't a ComparisonLevel)

It also occurred to me that it could make sense to split out the creation functions so that the user implements create_sql and create_label rather than create_level_dict. Here's an example:

https://github.com/RobinL/comparison_test/blob/c7a1dbd27ea3bae790f27277a7e0be475a4d24c4/splink/init.py

RobinL avatar Oct 31 '23 08:10 RobinL

The more I think about it, the more I think it makes sense to separate ComparisonLevel (as the fancy objects Splink works directly with for computation) from the user-facing instructions for creating them. This is another approach of mine based on ☝️ those two from @RobinL. Tbh I'm not sure it adds a great deal to the discussion, but there are a couple of aspects I wanted to try out.

Partly it was separating the dialect further from the creator class so that it only contextually has access to a dialect (whilst keeping things as simple as possible for subclassers implementing any concrete builders). I also wanted to get a little further into how the Linker injects the dialect on a case-by-case basis, and how things work with InputColumn (although I have muddied the waters by putting some of that functionality on the Creator class).

If I'm honest I thought I had more of a point when I started doing it than I think I have by the end, but I'm leaving it here for reference in case there is something useful in it

ADBond avatar Oct 31 '23 18:10 ADBond

This is a slightly more fleshed out ComparisonLevelCreator that pulls in some of your ideas and implements .configure just to get a sense of what it would look like

RobinL avatar Nov 01 '23 08:11 RobinL

Y'all should look at how Ibis does this with generic Operations (eg Join, OrderBy, ArrayAgg, etc), with specific backends then registering backend-specific implementations. They have some very mature and clean code to do this. Maybe it's over engineered for your uses here, but at least useful for inspiration.

NickCrews avatar Nov 04 '23 03:11 NickCrews