cjworkbench icon indicating copy to clipboard operation
cjworkbench copied to clipboard

County name cleanup module

Open jstray opened this issue 6 years ago • 9 comments

It would be very useful, for local data journalism, to have a module that cleans US county names and looks up their FIPS codes. Attached is a mockup of what this might look like. unnamed

jstray avatar May 10 '18 21:05 jstray

I had worked on something similar before. A system of human in the loop can be used to solve this problem. Think of it as a repository of all the different spelling errors that can happen for a county name. The model (along with human intervention) now matches these faulty spelling based on context (state name, area code, etc). Once people start using it, these faulty spellings can be cached in the system to make the process faster.

achyutjoshi avatar May 29 '19 06:05 achyutjoshi

Sounds right. Do you want to build it?

jstray avatar May 29 '19 15:05 jstray

I can surely try it out. Do you have some documentation around the issue which can help me get started?

achyutjoshi avatar May 31 '19 05:05 achyutjoshi

You could start with https://github.com/CJWorkbench/cjworkbench/wiki/Creating-A-Module

jstray avatar May 31 '19 17:05 jstray

@jstray Added a very first version of this. You can check it out here - https://github.com/achyutjoshi/hello-workbench and https://github.com/achyutjoshi/cjworkbench

You can try the workflow using this dummy dataset -

dummy_df = pd.DataFrame({'state' : ['maaryland state','Georgia','California','colarado','florida'],'county' : ['baltimore','brooks','achyut','jackson','jackson']})

Things to note -

  1. I am in the process of adding tests and documentation
  2. I know an existing bug which occurs when the 'tolerance > 80'. I will fix that soon too.

Would love to know your feedback.

achyutjoshi avatar Jun 13 '19 06:06 achyutjoshi

Hi! Thanks so much for this, it's a great start! I tested it in Workbench. Some notes:

  • I notice it requires fuzzywuzzy. Makes sense. But the Workbench docker container normally install that module so we'd have to add it before we could deploy this on our servers.
  • State "ca" and county "foo" resolves to Modoc County, California. This is more than edit distance 2 away, so I'm not sure why it matches to this. I'd expect the result to be null if there no match close enough.
  • Documentation would definitely be useful. You can set the help link in the yam, it could just go to the github readme for now.

Finally, please join us on Gitter for faster response https://gitter.im/workbenchdata/Lobby

jstray avatar Jun 19 '19 18:06 jstray

Thanks!

  1. Yes it does require fuzzywuzzy. Once we are done with the improvements, we can add the dependency to the docker container?
  2. State "ca" and county "foo" - What is the tolerance level you used? If I use 79 - it does work as expected.
  3. I will complete the documentation and add it to the GitHub readme.

achyutjoshi avatar Jun 23 '19 18:06 achyutjoshi

Ah I guess I am misunderstanding tolerance -- is it 0-100? I thought it was edit distance.

So 100=perfect matches only? Perhaps it should default to something much higher than 2.0. Or maybe it should work in reverse, default to zero, and be called "Match percentage error" or something with "percentage" in the name so users understand the range.

On Sun, Jun 23, 2019 at 2:01 PM Achyut Joshi [email protected] wrote:

Thanks!

  1. Yes it does require fuzzywuzzy. Once we are done with the improvements, we can add the dependency to the docker container?
  2. State "ca" and county "foo" - What is the tolerance level you used? If I use 79 - it does work as expected.
  3. I will complete the documentation and add it to the GitHub readme.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CJWorkbench/cjworkbench/issues/77?email_source=notifications&email_token=AAH3EFHZEC3OSAHXFE7MMZDP363BDA5CNFSM4E7KOH32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYLDZ3Q#issuecomment-504773870, or mute the thread https://github.com/notifications/unsubscribe-auth/AAH3EFAOIXJQPTTUC75HWSLP363BDANCNFSM4E7KOH3Q .

jstray avatar Jun 23 '19 19:06 jstray

Yes. 100 = perfect matches.

And yes, I will change the name so it is more intuitive and maybe default to something higher.

achyutjoshi avatar Jun 23 '19 19:06 achyutjoshi