splink
splink copied to clipboard
Add default postcode comparison function
Generate a case expression with 3-5 levels:
CASE
WHEN {full_match} THEN 4
WHEN {sector_match} THEN 3
WHEN {district_match} THEN 2
WHEN {area_match} THEN 1
ELSE 0
END
would something like this work or does this need to be fully in SQL format?
def PCmatch(pc_l, pc_r):
count = 0
while count < len(pc_l):
if pc_l[count] == pc_r[count]:
count += 1
else:
break
return count
Yeah, needs to be in the form of a SQL case expression
have a look on Slack for a ScalaUDF solution. Or is there an easier way and I am overcomplicating things?
Is that too naive? Wouldn't it be better to convert to a more general geographic indicator (rather than a postal area), like lat/lon, and then compare? With this sort of comparison you might have moved one street but have an entirely different postcode?
There must be a lot of solutions to this already with UK gov codebases.
We have support for distance as the crow flies using lat and long input columns (not yet in the pypi version, will land in the next release): https://github.com/moj-analytical-services/splink/blob/d80b0c72a71ebb93dfb2aef76f5f1ceb458f7623/splink/comparison_level_library.py#L228
It can just sometimes be a bit of a faff to join on the geolocation to the input data, so this would provide a quick and dirty solution that would get 80% of the way there
Could build ontop of #1190
Closed by #1230