splink icon indicating copy to clipboard operation
splink copied to clipboard

[FEAT] Improved docs on estimate_probability_two_random_records_match

Open sama-ds opened this issue 1 year ago • 1 comments

Is your proposal related to a problem?

Currently, the sole documentation of this function is the docstring. Whilst this defines what needs to go into the function, there is a gap in the documentation for this on why these parameters go in the function, what the function does "under-the-hood", and how this impacts the wider model by the prior.

Describe the solution you'd like

A full description of:

  • What deterministic rules are designed to do
  • What the recall means in relation to the deterministic rule
  • How these two are combined mathematically to calculate the probability
  • How the probability is leads to the calculation of the prior
  • How having a "bad" set of deterministic rules may affect your prior (highlighting the importance of this)
  • How increasing/decreasing the recall affects the prior (highlighting the relative lesser importance of this
  • Why having an accurate prior is important

This could sit with the function directly, or, as these are blocking rules, it could sit as an additional section within the blocking rules topic guide.

sama-ds avatar Oct 06 '23 10:10 sama-ds

FWIW here is the discussion of why we use this approach, may contain useful information for the docs https://github.com/moj-analytical-services/splink/issues/462

RobinL avatar Oct 06 '23 19:10 RobinL