splink icon indicating copy to clipboard operation
splink copied to clipboard

Splink4: How to enable comparison_levels with multiple input columns

Open RobinL opened this issue 1 year ago • 3 comments

Problem

At the moment we have

class ComparisonLevelCreator(ABC):
    def __init__(self, col_name: str = None):
        """
        Class to author ComparisonLevels
        Args:
            col_name (str): Input column name
        """
        self.col_name = col_name

This allows us to automatically expose an input_column method:

    @final
    def input_column(self, sql_dialect: SplinkDialect) -> InputColumn:
        return InputColumn(self.col_name, sql_dialect=sql_dialect.sqlglot_name)

so that in a child (e.g.LevenshteinLevel) they can write something like:

def create_sql(self, sql_dialect: SplinkDialect) -> str:
        col = self.input_column(sql_dialect)
        lev_fn = sql_dialect.levenshtein_function_name
        return f"{lev_fn}({col.name_l()}, {col.name_r()}) <= {self.distance_threshold}"

How can we extend this pattern to allow multiple col_names, e.g. for a ColumnsReveresedLevel

Discussion

I have tried writing a decorator like this:

class ColumnsReveresedLevel(ComparisonLevelCreator):
    @transform_strings("my_arg_1", "my_arg_2")
    def __init__(my_arg_1, my_arg_2, other_arg):

That would automatically convert the args into InputColumn factories e.g. self.my_arg_1(dialect)

it turns out this is a little tricky.

But the more I think about it, you still have to write:

def create_sql(self, sql_dialect: SplinkDialect) -> str:
        my_arg_1_input_col = self.my_arg_1(sql_dialect):

Is that really any better than simply:

class ColumnsReveresedLevel(ComparisonLevelCreator):
    def __init__(my_arg_1, my_arg_2, other_arg):
        self.my_arg_1 = my_arg_1
        self.my_arg_2 = my_arg_2
        etc.

    def create_sql(self, sql_dialect):
        my_arg_1_input_col = InputColumn(self.my_arg_1,sql_dialect)

It's not obvious to me there's any way of dramatically simplifying this for the user

So my proposal is simply that we get rid of the existing mechanism of setting self.col_name = col_name and exposing the input_column method and instead the user simply uses InputColumn directly

This actually has one advantage: Since the user will always have to use InputColumn to properly create their level (to robustly produce the _l , _r etc. irrespective of need for escaping etc), then i think it's better for htem to know their using it explicitly

RobinL avatar Nov 11 '23 15:11 RobinL

I think this makes sense - not sure I was too convinced at any point that we needed col_name + input_column in general, especially as it only really fits for one-column levels.

As you say I think the fact is that users need to know (at least a little) about how InputColumn works anyway to form the SQL string, so why not just get them to use this.

There is also maybe something conceptually nice about not having to keep any data in the base class - each subclass ca just define what it needs, and not have to worry about passing things to super()__init__().

ADBond avatar Nov 13 '23 11:11 ADBond

However maybe there is an argument that we should have people use an intermediate class instead? InputColumn has a lot of stuff that goes with it (which is needed by the Linker), but which people writing comparison levels almost certainly shouldn't be directly messing around with - I worry that people might get a bit lost in all the methods when they basically only need name_l + name_r.

It would also mean we are more free to mess around with InputColumn without worrying so much about user-impact - in the same way that we are aiming to do with Comparison and ComparisonLevel

ADBond avatar Nov 13 '23 12:11 ADBond

I agree with both points (avoiding super(), and there maybe should be an intermediate class)

As @ThomasHepworth has just pointed out, there are potential additions to InputColumn such as to_date, regex extract, lower, that make it even more complex (and hence harder to use for end user).

RobinL avatar Nov 13 '23 12:11 RobinL