Robin Linacre
Robin Linacre
We round probability to 5dp, meaning that no match probabilities above about a mw of 17 appear on the chart Changes needed: unlinkables.py ``` def unlinkables_data(linker: Linker) -> list[dict[str, Any]]:...
In the codebase we often have to be careful when looking at the equality of two columns to ensure we're not comparing quoted to unquoted I wondered whether it may...
### Discussed in https://github.com/moj-analytical-services/splink/discussions/2526 Originally posted by **medwar99** November 27, 2024 ### Is your proposal related to a problem? The labelling tool defaults to showing the Splink predictions by default,...
============================== slowest durations =============================== 42.14s call tests/test_debug_mode.py::test_debug_mode_combined_training[spark] 21.49s call tests/test_debug_mode.py::test_debug_mode_ptrrm_train[spark] 20.24s call tests/test_analyse_blocking.py::test_analyse_blocking_slow_methodology[spark] 18.65s call tests/test_debug_mode.py::test_debug_mode_u_training[spark] 17.38s call tests/test_full_example_spark.py::test_full_example_spark 14.38s call tests/test_debug_mode.py::test_debug_mode_em_training[spark] 13.66s call tests/test_debug_mode.py::test_debug_mode_profile_columns[spark]
Following #2847, now that implicit cache has been removed, give user ability to manage cache explicitly ## Blocked ID pairs I have removed the `materialise_blocked_pairs: bool = True` flag on...
Following https://github.com/moj-analytical-services/splink/pull/2850 The only fiddly bit here was implementing clamping to avoid floating point errors. It's actually much easier when working with match weights. The key parts are where we...
Following https://github.com/moj-analytical-services/splink/pull/2849 Add support for chunking and make sure it integrates correctly with caching/table management
Following https://github.com/moj-analytical-services/splink/pull/2848 Remove salting. Note, there's [a comment](https://github.com/moj-analytical-services/splink/blob/ea372bf667df3a1b8f5b413a50a59004b1c86a4a/splink/internals/settings.py#L646-L647) in the current code that indicates we need salting for duckdb to parallelise `linker.training.estimate_u_using_random_sampling`, I've double checked and this is no longer...
- Do all calculations with additive match weights rather than multiplicative bayes factors see [here](https://github.com/moj-analytical-services/splink/issues/1889) - Somehow use the `splink_udfs` duckdb extension - Consider changing how real time linking works...
## Summary The `_input_columns` method in `splink/internals/linker.py` (lines 186-245) is only used in one place and can be replaced with simpler existing code. ## Current Usage The method is only...