address-matching icon indicating copy to clipboard operation
address-matching copied to clipboard

Ran address matching, here's a report

Open jpvelez opened this issue 10 years ago • 6 comments

Just ran dedupe on my dataset!

Learning

During the learning step, I labeled 2012 examples. Here's what the labels ended up being:

yes: 1, no: 204, unsure: 7

The "yes" example came up in the first three, then there were no more yeses.

After about ~30 "no's", I got a rash of address comparison where the street number was off by a single digit or two i.e. 5136 S Tripp Ave and 5135 S Tripp Ave.

I wasn't sure whether to research these addresses in order to answer them truthfully, because as many of those might be the same building as not (I now realize that's an assumption that maybe dedupe could have dealt with, by outputting the right "rate" of these kinds of guesses). So I labeled them all "unsure." Then they stopped appearing and I did hundreds of "nos".

Clustering

Total running time: 2:30

Here's the shell output

INFO:dedupe.api:3 folds
INFO:dedupe.crossvalidation:using cross validation to find optimum alpha...
INFO:dedupe.crossvalidation:optimum alpha: 1.000000
INFO:dedupe.api:Learned Weights
INFO:dedupe.api:('address', -0.03654640167951584)
INFO:dedupe.api:('bias', -2.899808406829834)
INFO:dedupe.blocking:Calculating coverage of simple predicates
INFO:dedupe.blocking:Calculating coverage of tf-idf predicates
INFO:dedupe.blocking:defaultdict(<type 'set'>, {})
INFO:dedupe.tfidf:Canopy: TF-IDF:0.4address
INFO:dedupe.tfidf:Canopy: TF-IDF:0.6address
INFO:dedupe.tfidf:Canopy: TF-IDF:0.2address
INFO:dedupe.tfidf:Canopy: TF-IDF:0.8address
INFO:dedupe.blocking:coverage threshold: 32207
INFO:dedupe.blocking:Before removing liberal predicates, 13 predicates
INFO:dedupe.blocking:After removing liberal predicates, 13 predicates
INFO:dedupe.blocking:Final predicate set:
INFO:dedupe.blocking:[('wholeFieldPredicate', 'address')]
INFO:dedupe.blocking:defaultdict(<type 'set'>, {'address': set(['ave', 's', 'st', 'w', 'n'])})
INFO:dedupe.blocking:0, 0.0000812 seconds

...

INFO:dedupe.api:Maximum expected recall and precision
INFO:dedupe.api:recall: 1.000
INFO:dedupe.api:precision: 0.051
INFO:dedupe.api:With threshold: 0.051
clustering...
duplicate sets 368645

Accuracy test

As a quickie accuracy test, I eyeballed the first hundred records of the output csv.

100 of the 100 first records were identical address matches.

So my sample precision is 100%. Who knows what the sample recall is.

Next steps

@fgregg @derekeder, any thoughts on what this means?

Did this perform as expected, given that we were comparison on a single address field?

Next, I'm going to measure how many buildings with no building age data had a match (positive or not).

jpvelez avatar Mar 26 '14 14:03 jpvelez