address-matching
address-matching copied to clipboard
Python script for matching a list of messy addresses against a gazetteer using dedupe.
address-matching
Python script for matching a list of messy addresses against a gazetteer using dedupe. This also functions as a pseudo geocoder if your Gazetteer has lat/long information.
Part of the Dedupe.io cloud service and open source toolset for de-duplicating and finding fuzzy matches in your data.
Setup
Here's how to get this script working - without having dedupe already installed.
git clone [email protected]:datamade/address-matching.git
cd address-matching
pip install "numpy>=1.6"
pip install -r requirements.txt
Gazetteer
You will need a Gazetteer of all unique addresses in a given area. For this example, we used the Cook County Address Point shapefile.
List addresses you want to match
This program takes a list of addresses and matches them to individual records in the Gazetteer. For this example, we are using a messy list of early childhood education locations in Chicago. This file can have multiple entries referring to the same place.
Usage
Once you have a Gazetteer and a messy input file, run address_matching.py
python address_matching.py
You will be prompted to label some training pairs for dedupe to do its thing. More on this here.
The output will be saved to address_matching_output.csv