cjworkbench
cjworkbench copied to clipboard
County name cleanup module
It would be very useful, for local data journalism, to have a module that cleans US county names and looks up their FIPS codes. Attached is a mockup of what this might look like.
I had worked on something similar before. A system of human in the loop can be used to solve this problem. Think of it as a repository of all the different spelling errors that can happen for a county name. The model (along with human intervention) now matches these faulty spelling based on context (state name, area code, etc). Once people start using it, these faulty spellings can be cached in the system to make the process faster.
Sounds right. Do you want to build it?
I can surely try it out. Do you have some documentation around the issue which can help me get started?
You could start with https://github.com/CJWorkbench/cjworkbench/wiki/Creating-A-Module
@jstray Added a very first version of this. You can check it out here - https://github.com/achyutjoshi/hello-workbench and https://github.com/achyutjoshi/cjworkbench
You can try the workflow using this dummy dataset -
dummy_df = pd.DataFrame({'state' : ['maaryland state','Georgia','California','colarado','florida'],'county' : ['baltimore','brooks','achyut','jackson','jackson']})
Things to note -
- I am in the process of adding tests and documentation
- I know an existing bug which occurs when the 'tolerance > 80'. I will fix that soon too.
Would love to know your feedback.
Hi! Thanks so much for this, it's a great start! I tested it in Workbench. Some notes:
- I notice it requires fuzzywuzzy. Makes sense. But the Workbench docker container normally install that module so we'd have to add it before we could deploy this on our servers.
- State "ca" and county "foo" resolves to Modoc County, California. This is more than edit distance 2 away, so I'm not sure why it matches to this. I'd expect the result to be null if there no match close enough.
- Documentation would definitely be useful. You can set the help link in the yam, it could just go to the github readme for now.
Finally, please join us on Gitter for faster response https://gitter.im/workbenchdata/Lobby
Thanks!
- Yes it does require fuzzywuzzy. Once we are done with the improvements, we can add the dependency to the docker container?
- State "ca" and county "foo" - What is the tolerance level you used? If I use 79 - it does work as expected.
- I will complete the documentation and add it to the GitHub readme.
Ah I guess I am misunderstanding tolerance -- is it 0-100? I thought it was edit distance.
So 100=perfect matches only? Perhaps it should default to something much higher than 2.0. Or maybe it should work in reverse, default to zero, and be called "Match percentage error" or something with "percentage" in the name so users understand the range.
On Sun, Jun 23, 2019 at 2:01 PM Achyut Joshi [email protected] wrote:
Thanks!
- Yes it does require fuzzywuzzy. Once we are done with the improvements, we can add the dependency to the docker container?
- State "ca" and county "foo" - What is the tolerance level you used? If I use 79 - it does work as expected.
- I will complete the documentation and add it to the GitHub readme.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CJWorkbench/cjworkbench/issues/77?email_source=notifications&email_token=AAH3EFHZEC3OSAHXFE7MMZDP363BDA5CNFSM4E7KOH32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYLDZ3Q#issuecomment-504773870, or mute the thread https://github.com/notifications/unsubscribe-auth/AAH3EFAOIXJQPTTUC75HWSLP363BDANCNFSM4E7KOH3Q .
Yes. 100 = perfect matches.
And yes, I will change the name so it is more intuitive and maybe default to something higher.