Robin Linacre issues

Results 91 issues of


                                            Robin Linacre

(WIP) Refactor compare two records/Improve performance/add features

[BUG] Reinstantiating Linker with same input_table_aliases causes error

``` import duckdb from splink import DuckDBAPI, Linker, SettingsCreator, splink_datasets con = duckdb.connect() db_api = DuckDBAPI(connection=con) df = splink_datasets.fake_1000 settings = SettingsCreator( link_type="dedupe_only", ) linker = Linker(df, settings, db_api, input_table_aliases=["mytable"])...

When registering a table, it should be possible to specify both a physical and templated name

At the moment if you pass in e.g. a pandas dataframe, the physical name of the table that's registered against the database system is always the same as the templated...

[BUG] compare_two_records fails in Spark if some values are None

This fails in Spark: ``` r1 = { "first_name": "John", "surname": "Smith", "dob": "1980-01-01", } r2 = { "first_name": "John", "surname": "Smith", "dob": None, } linker.inference.compare_two_records(r1, r2).as_pandas_dataframe() ``` with ```...

[feat] allow max rows for em training

There's No reason we can't use the salt value to reduce the number of comparisons generated during em training

DateOfBirthComparison - when dates are string format "%Y%m%d" - does not seem to work

### Discussed in https://github.com/moj-analytical-services/splink/discussions/2713 Originally posted by **mashby1966** June 8, 2025 I have the date of birth in my dataset as YYYYMMDD (%Y%m%d) format and I have tried to use...

Use with recursive for faster clustering

https://duckdb.org/docs/sql/query_syntax/with.html#recursive-ctes

enhancement

(WIP) 2580 improve runtimes but pushing up common case statements into precomputed values

Very much work in progress for now, but the overall approach here seems to work See #2580 Closes #2580

Improve runtimes by 'pushing up' common Case Statements into precomputed values

Currently, in Splink, comparison functions (e.g., `cosine_sim`) are evaluated multiple times within `CASE` statements during the `predict()` process. For example: ```sql CASE WHEN cosine_sim(l, r) > 0.9 THEN 1 WHEN...

Optimise speed of training u

It would be faster and more memory efficient to train `u` probabilities comparison-by-comparison rather than doing them 'all in one' because: - Doing them 'all in one' uses more memory....