Robin Linacre comments

Results 234 comments of


                                            Robin Linacre

[FEAT] Include the option to retain the splinkblocked_id_pairs table

I just had a look at this. The first thing I tried was: ```python import splink.comparison_library as cl from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets db_api = DuckDBAPI() df...

[FEAT] Include the option to retain the splinkblocked_id_pairs table

@fscholes any thoughts?

[BUG] compare_two_records fails in Spark if some values are None

Fix is that table registration should accept an arrow row https://github.com/moj-analytical-services/splink/blob/8b44ab58d39a798a443e1ec5ddef6149f072ace2/splink/internals/spark/database_api.py#L72 Actually that's no good because you can't pass arrow directly to Spark

[BUG] compare_two_records fails in Spark if some values are None

``` from pyspark.sql.types import StructType, StructField, StringType r1 = { "first_name": "John", "surname": "Smith", "dob": None } r2 = { "first_name": "John", "surname": "Smith", "dob": "1980-01-01", } schema = StructType([...

[BUG] compare_two_records fails in Spark if some values are None

The only reason you can't do that at the moment is that we add [] around the record! We should only do that if it's a dict https://github.com/moj-analytical-services/splink/blob/8b44ab58d39a798a443e1ec5ddef6149f072ace2/splink/internals/linker_components/inference.py#L521 That should...

[BUG] compare_two_records fails in Spark if some values are None

I applied a fix that allows two schemas sparkdataframes to be passed in in compre two records: ``` if isinstance(record_1, dict): record_1 = [record_1] if isinstance(record_2, dict): record_2 = [record_2]...

[BUG] compare_two_records fails in Spark if some values are None

In Splink 4 the thing that changed is that blocking results in a pairwise table of records. That's probably the cause of the bug It's a bit of hassle, but...

[BUG] compare_two_records fails in Spark if some values are None

This now works: ``` from pyspark.sql.types import StructType, StructField, StringType r1 = { "first_name": "John", "surname": "Smith", "dob": None } r2 = { "first_name": "John", "surname": "Smith", "dob": "1980-01-01", }...

Test `debug_mode` more

Could be a useful script as a start for tests, but first need to identify the cases where it breaks Script to help understand debug mode to pinpoint issues ```python...

`debug_mode` breaks training workflow

Took a quick look and haven't got anywhere but this is probably useful when we get more time to have a proper look Script to help understand debug mode to...