house_expenditures icon indicating copy to clipboard operation
house_expenditures copied to clipboard

Clean the `purpose` variable

Open supermdat opened this issue 7 years ago • 2 comments

The values in the data appear to be manually entered, and are therefore not standardized. This means that unique entities for the purpose variable are spelled differently and should be collapsed into one value. For example, "CONG AIDE/OUTREACH SERVICES", and "CONGRESS AIDE/OUTREACH SER" are presumably the same. Similarly, "EXECUTIVE ASSISTANT/LEGISLATIV", and "EXECUTIVE ASSISTANT/LEGISLATIV (OTHER COMPENSATION)" may not be exactly the same, but could/should be aggregated.

Issue #29 took steps to clean this variable using the Jaro-Winker distance from stringdist::stringdist, but some duplicates remain, and additional cleaning would be useful.

Any method (topic modeling, word2vec, etc.) would be acceptable so long as it is accurate and scalable. Because of the large number of unique entries for this variable, condensing the entries into similar categories (e.g., with topic modeling) may be particular beneficial.

supermdat avatar Sep 25 '17 22:09 supermdat

I can start working on this, probably will put in a pull request later this week. Do you guys have a preference for R? or is python ok?

frankiezeager avatar Oct 03 '17 01:10 frankiezeager

Cool! Thanks, @frankiezeager! Feel free to let me know if you have any questions about what I tried before to help standardize/clean this variable.

Most of the work on the repo has been in R, but I don't think it should matter much. As long as we're transparent and have a method for connecting the original data to the cleaned-up data, then I think we're good.

supermdat avatar Oct 03 '17 21:10 supermdat