splink
splink copied to clipboard
`linker.estimate_u_using_random_sampling` fails with default arguments, with no clear indication why
Splink 4:
from splink import DuckDBAPI, Linker, SettingsCreator, splink_datasets
import splink.comparison_library as cl
settings = SettingsCreator("dedupe_only", [cl.ExactMatch("first_name"), cl.ExactMatch("surname")])
linker = Linker(splink_datasets.fake_1000, settings, DuckDBAPI())
linker.estimate_u_using_random_sampling()
Get the error: TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'
Some options for behaviour:
-
max_pairs = None
means we set proportion as 1, similar to if we have moremax_pairs
than possible record pairs - ~~explicit error message if it is left as
None
, forcing user to set a non-default value~~ making this argument non-optional - a directly valid default value
Tricky, especially since it's unlikely the user will have an intuition for the ''correct'/'best' parameter (it wouldn't be very good if they set it to like, 100 or something!)
I'd possibly set the default value to 1e6
which will be quick to compute and will produce 'reasonably decent' estimates. And make sure we try to ensure in examples/tutorials we always explicitly set it to sensible values
I'd possibly set the default value to
1e6
which will be quick to compute and will produce 'reasonably decent' estimates. And make sure we try to ensure in examples/tutorials we always explicitly set it to sensible values
My only concern with that is that we don't really have a mechanism for flagging that estimates may be unreliable - I'd worry some users might not look at the docs and just use the default, and then get (somewhat) unreliable estimates. I guess we could emit a warning if it's left at this value that they may want to increase the number for more serious estimation?
Agreed - i think a warning if it's left as the default should suffice