house_expenditures icon indicating copy to clipboard operation
house_expenditures copied to clipboard

Clean the `office` variable

Open supermdat opened this issue 7 years ago • 2 comments

The values in the data appear to be manually entered, and are therefore not standardized. This means that unique entities for the office variable are spelled differently and should be collapsed into one value. For example, "NET EXPENSES OF EQUIP", and "NET EXPENSES OF EQUIPMENT" are presumably the same. Similarly, "HOUSE CHILD CARE GENERAL FUND", "HOUSE CHILD CARE CENTER", and "CHILD CARE CTR" may not be exactly the same, but could/should be aggregated.

Issue #29 took steps to clean this variable using the Jaro-Winker distance from stringdist::stringdist, but some duplicates remain, and additional cleaning would be useful.

Any method (topic modeling, word2vec, etc.) would be acceptable so long as it is accurate and scalable.

supermdat avatar Sep 25 '17 21:09 supermdat

I'm interested in giving this a shot. I'm more familiar with the modeling aspect of this issue and less familiar working with GitHub so some direction is very much appreciated. I see the data on data.world...I'm assuming the data set that I can work with is called "house_expenditures_clean"? Also should I clone the repo to my laptop or make a pull request? Thanks!

nkk36 avatar Oct 24 '17 02:10 nkk36

Hi @nkk36! Glad to have you here!

Yes, please go ahead and use "house_expenditures_clean" from data.world as the base data - it should include all of the data except for 2017 Q2. You can also take a look at Issue #29 to see what I had tried before.

I don't think any of us (myself included) are very strong at GitHub - I know just enough to do what I need. Generally speaking, you want to:

  1. For the repo
  2. Clone it to your local
  3. Make commits to your local
  4. When you're ready, make a pull request

I generally found the D4D GitHub info to be helpful: https://github.com/Data4Democracy/github-playground

As well as this site: https://gist.github.com/hofmannsven/6814451

supermdat avatar Oct 25 '17 14:10 supermdat