Robin Linacre issues

Results 91 issues of


                                            Robin Linacre

Unlinkables chart cuts off for high match weights

We round probability to 5dp, meaning that no match probabilities above about a mw of 17 appear on the chart Changes needed: unlinkables.py ``` def unlinkables_data(linker: Linker) -> list[dict[str, Any]]:...

good first issue

Use eq to allow InputColumns to be compared without having to call quote() or unquote()

In the codebase we often have to be careful when looking at the equality of two columns to ensure we're not comparing quoted to unquoted I wondered whether it may...

[FEAT] Labelling tool - Hide Splink predictions from the page by default

### Discussed in https://github.com/moj-analytical-services/splink/discussions/2526 Originally posted by **medwar99** November 27, 2024 ### Is your proposal related to a problem? The labelling tool defaults to showing the Splink predictions by default,...

Test duration of debug_mode tests significantly increasing total time of test suite

============================== slowest durations =============================== 42.14s call tests/test_debug_mode.py::test_debug_mode_combined_training[spark] 21.49s call tests/test_debug_mode.py::test_debug_mode_ptrrm_train[spark] 20.24s call tests/test_analyse_blocking.py::test_analyse_blocking_slow_methodology[spark] 18.65s call tests/test_debug_mode.py::test_debug_mode_u_training[spark] 17.38s call tests/test_full_example_spark.py::test_full_example_spark 14.38s call tests/test_debug_mode.py::test_debug_mode_em_training[spark] 13.66s call tests/test_debug_mode.py::test_debug_mode_profile_columns[spark]

Splink 5 - Explicit cache table mgt fns

Following #2847, now that implicit cache has been removed, give user ability to manage cache explicitly ## Blocked ID pairs I have removed the `materialise_blocked_pairs: bool = True` flag on...

splink_5

Splink5 - Bayes factors to match weights final

Following https://github.com/moj-analytical-services/splink/pull/2850 The only fiddly bit here was implementing clamping to avoid floating point errors. It's actually much easier when working with match weights. The key parts are where we...

splink_5

Splink 5 - Add chunking

Following https://github.com/moj-analytical-services/splink/pull/2849 Add support for chunking and make sure it integrates correctly with caching/table management

splink_5

Splink 5 - Remove salting

Following https://github.com/moj-analytical-services/splink/pull/2848 Remove salting. Note, there's [a comment](https://github.com/moj-analytical-services/splink/blob/ea372bf667df3a1b8f5b413a50a59004b1c86a4a/splink/internals/settings.py#L646-L647) in the current code that indicates we need salting for duckdb to parallelise `linker.training.estimate_u_using_random_sampling`, I've double checked and this is no longer...

splink_5

Possible Splink 5 ideas

- Do all calculations with additive match weights rather than multiplicative bayes factors see [here](https://github.com/moj-analytical-services/splink/issues/1889) - Somehow use the `splink_udfs` duckdb extension - Consider changing how real time linking works...

# Remove unused `_input_columns` method from Linker

## Summary The `_input_columns` method in `splink/internals/linker.py` (lines 186-245) is only used in one place and can be replaced with simpler existing code. ## Current Usage The method is only...