splink icon indicating copy to clipboard operation
splink copied to clipboard

Add default postcode comparison function

Open samnlindsay opened this issue 3 years ago • 5 comments

image

Generate a case expression with 3-5 levels:

CASE 
WHEN {full_match} THEN 4
WHEN {sector_match} THEN 3
WHEN {district_match} THEN 2
WHEN {area_match} THEN 1
ELSE 0 
END

samnlindsay avatar Oct 12 '21 11:10 samnlindsay

would something like this work or does this need to be fully in SQL format?

def PCmatch(pc_l, pc_r):
    count = 0
    while count < len(pc_l):
        if pc_l[count] == pc_r[count]:
            count += 1
        else:
            break
    return count

mamonu avatar Oct 26 '21 23:10 mamonu

Yeah, needs to be in the form of a SQL case expression

RobinL avatar Nov 11 '21 07:11 RobinL

have a look on Slack for a ScalaUDF solution. Or is there an easier way and I am overcomplicating things?

mamonu avatar Nov 11 '21 13:11 mamonu

Is that too naive? Wouldn't it be better to convert to a more general geographic indicator (rather than a postal area), like lat/lon, and then compare? With this sort of comparison you might have moved one street but have an entirely different postcode?

There must be a lot of solutions to this already with UK gov codebases.

pbhj avatar Sep 05 '22 13:09 pbhj

We have support for distance as the crow flies using lat and long input columns (not yet in the pypi version, will land in the next release): https://github.com/moj-analytical-services/splink/blob/d80b0c72a71ebb93dfb2aef76f5f1ceb458f7623/splink/comparison_level_library.py#L228

It can just sometimes be a bit of a faff to join on the geolocation to the input data, so this would provide a quick and dirty solution that would get 80% of the way there

RobinL avatar Sep 05 '22 13:09 RobinL

Could build ontop of #1190

RossKen avatar Apr 14 '23 16:04 RossKen

Closed by #1230

RossKen avatar May 21 '23 00:05 RossKen