Robin Linacre

Results 91 issues of Robin Linacre

``` import duckdb from splink import DuckDBAPI, Linker, SettingsCreator, splink_datasets con = duckdb.connect() db_api = DuckDBAPI(connection=con) df = splink_datasets.fake_1000 settings = SettingsCreator( link_type="dedupe_only", ) linker = Linker(df, settings, db_api, input_table_aliases=["mytable"])...

At the moment if you pass in e.g. a pandas dataframe, the physical name of the table that's registered against the database system is always the same as the templated...

This fails in Spark: ``` r1 = { "first_name": "John", "surname": "Smith", "dob": "1980-01-01", } r2 = { "first_name": "John", "surname": "Smith", "dob": None, } linker.inference.compare_two_records(r1, r2).as_pandas_dataframe() ``` with ```...

There's No reason we can't use the salt value to reduce the number of comparisons generated during em training

### Discussed in https://github.com/moj-analytical-services/splink/discussions/2713 Originally posted by **mashby1966** June 8, 2025 I have the date of birth in my dataset as YYYYMMDD (%Y%m%d) format and I have tried to use...

https://duckdb.org/docs/sql/query_syntax/with.html#recursive-ctes

enhancement

Very much work in progress for now, but the overall approach here seems to work See #2580 Closes #2580

Currently, in Splink, comparison functions (e.g., `cosine_sim`) are evaluated multiple times within `CASE` statements during the `predict()` process. For example: ```sql CASE WHEN cosine_sim(l, r) > 0.9 THEN 1 WHEN...

It would be faster and more memory efficient to train `u` probabilities comparison-by-comparison rather than doing them 'all in one' because: - Doing them 'all in one' uses more memory....