gendercoder
gendercoder copied to clipboard
add fuzzy matching
There's a persistent issue where people provide expansive and idiosyncratic responses (e.g. "I'm sexually female") that can be reasonably classified by a human user, but are difficult to accommodate in the dictionaries method as it stands.
There are a number of suggestions for how we might resolve this (e.g. grep), but these of course have potential issues with unknown future inputs. Emily also likes how the current process gives you a transparent log of how recoding happens which becomes trickier with fuzzy matching.
This is a summary of the proposed (by Emily and I) implementation of any fuzzy matching.
Fuzzy matching should:
- not be default
- require deliberate action to implement (i.e. not just
fuzzy = TRUE
), - require user input to validate matches.
The core function arguments would default to:
gender_recode <- function(gender = gender, dictionary = gendercoder::broad, fill = FALSE, match = "exact")
And implementation would be:
gender_recode(gender_data, dictionary = broad, fill = TRUE, match = "fuzzy")
> gendercoder has exactly matched 99 (99%) of cases
> gendercoder suggests that "I'm sexually female" indicates a gender of: female. Please provide input:
1. Yes, female
2. No, male
3. No, Sex and gender diverse
4. No, other (provide text input)
5. No, replace with NA
Selection:
Keen to get input on alternatives and implementations.
One issue with this would be that it would create a pipeline that is not reproducible and can't be run inside a rmarkdown document (without author input).
Given that we already allow use of a custom dictionary could we instead have a function like gender_create_dictionary()
that has a similar implementation except that it uses the user responses to build a custom dictionary? That way people could just apply the custom dictionary when running the code again.
I imagine a pipeline like
- User applies
gender_recode()
with a broad dictionary to data and 99% are matched - User applies
gender_create_dictionary
to unmatched data to create a new dictionary that recodes previously unmatched responses. Optionally the code to create this dictionary is provided as a message so it can be easily reused. - User applies
gender_recode
to data with the custom dictionary and remaining 1% are matched - If the gender recoding needs to be re-run this could be achieved by using
gender_recode(dictionary = c(broad, custom))
Also, the selection options should have 6. No, replace with inputted value. This would be useful for novel responses like apogender that should be added to the dictionary without requiring the user to retype.
Except that also doesn't run in a RMD.
The other hang-up is that this is going to have scalability problems. Taking inputs for 12 fuzzy matches is fine. Taking it for 120 is going to be a PITA
Why wouldn't it run in an RMD? In that workflow you would use the message text from Step 2 to recreate the custom dictionary programmatically.
Not in one pass I mean. Yes, once the dictionary is created, it's created, but there's still interactive built into that pipeline.
Yes, I can't see much way around that without simply skipping validation of fuzzy matches which seems dangerous