gendercoder icon indicating copy to clipboard operation
gendercoder copied to clipboard

add fuzzy matching

Open Lingtax opened this issue 3 years ago • 5 comments

There's a persistent issue where people provide expansive and idiosyncratic responses (e.g. "I'm sexually female") that can be reasonably classified by a human user, but are difficult to accommodate in the dictionaries method as it stands.

There are a number of suggestions for how we might resolve this (e.g. grep), but these of course have potential issues with unknown future inputs. Emily also likes how the current process gives you a transparent log of how recoding happens which becomes trickier with fuzzy matching.

This is a summary of the proposed (by Emily and I) implementation of any fuzzy matching.

Fuzzy matching should:

  1. not be default
  2. require deliberate action to implement (i.e. not just fuzzy = TRUE),
  3. require user input to validate matches.

The core function arguments would default to: gender_recode <- function(gender = gender, dictionary = gendercoder::broad, fill = FALSE, match = "exact")

And implementation would be:

gender_recode(gender_data, dictionary = broad, fill = TRUE, match = "fuzzy")

> gendercoder has exactly matched 99 (99%) of cases
> gendercoder suggests that "I'm sexually female" indicates a gender of: female. Please provide input:

1. Yes, female
2. No, male
3. No, Sex and gender diverse
4. No, other (provide text input)
5. No, replace with NA

 Selection:

Keen to get input on alternatives and implementations.

Lingtax avatar Mar 18 '21 00:03 Lingtax

One issue with this would be that it would create a pipeline that is not reproducible and can't be run inside a rmarkdown document (without author input).

Given that we already allow use of a custom dictionary could we instead have a function like gender_create_dictionary() that has a similar implementation except that it uses the user responses to build a custom dictionary? That way people could just apply the custom dictionary when running the code again.

I imagine a pipeline like

  1. User applies gender_recode() with a broad dictionary to data and 99% are matched
  2. User applies gender_create_dictionary to unmatched data to create a new dictionary that recodes previously unmatched responses. Optionally the code to create this dictionary is provided as a message so it can be easily reused.
  3. User applies gender_recode to data with the custom dictionary and remaining 1% are matched
  4. If the gender recoding needs to be re-run this could be achieved by using gender_recode(dictionary = c(broad, custom))

Also, the selection options should have 6. No, replace with inputted value. This would be useful for novel responses like apogender that should be added to the dictionary without requiring the user to retype.

ekothe avatar Mar 18 '21 21:03 ekothe

Except that also doesn't run in a RMD.

The other hang-up is that this is going to have scalability problems. Taking inputs for 12 fuzzy matches is fine. Taking it for 120 is going to be a PITA

Lingtax avatar Mar 18 '21 23:03 Lingtax

Why wouldn't it run in an RMD? In that workflow you would use the message text from Step 2 to recreate the custom dictionary programmatically.

ekothe avatar Mar 19 '21 00:03 ekothe

Not in one pass I mean. Yes, once the dictionary is created, it's created, but there's still interactive built into that pipeline.

Lingtax avatar Mar 19 '21 00:03 Lingtax

Yes, I can't see much way around that without simply skipping validation of fuzzy matches which seems dangerous

ekothe avatar Mar 19 '21 01:03 ekothe