countrycode icon indicating copy to clipboard operation
countrycode copied to clipboard

Regex: common abbreviations

Open Olucik opened this issue 6 years ago • 12 comments

countrycode("Centr. African Rep.", "country.name", "country.name") should result in "Central African Republic" countrycode("Dominic. Republic.", "country.name", "country.name") should result in "Dominican Republic" countrycode("Kuweit", "country.name", "country.name") should result in "Kuwait" countrycode("Timor", "country.name", "country.name") should result in "Timor Leste" countrycode("Korea, Democratic Republic of", "country.name", "country.name") should result in "Democratic People's Republic of Korea"

Olucik avatar Sep 18 '17 14:09 Olucik

Thanks for the report, but I'm not sure about some of these.

  • "Korea, Democratic Republic of" is picked up properly by the latest version of the package (install from github using devtools)
  • Kuweit is a misspelling (or different language). I don't think countrycode should be expected to fix all possible misspellings. That's a very deep rabbit hole.
  • Same thing with abbreviations. I'm not completely opposed to supporting some very widespread abbreviations (e.g., USA), but "Dominic" instead of "Dominican" seems like a dataset-specific issue more than a common case that countrycode ought to support. There's a downside to including too many wildcards in regexes, since it increases the chance that we'll pick up false positives.
  • Timor is not the same as Timor-Leste, and we probably want to pickup only those that match with the corresponding ISO codes in the countrycode conversion dictionary.

I'm not completely closed on any of these ideas, but an actual argument should be made before effort is expended.

vincentarelbundock avatar Sep 18 '17 14:09 vincentarelbundock

Also, you might want to look into the custom_match argument.

vincentarelbundock avatar Sep 18 '17 14:09 vincentarelbundock

thank you "Korea, Democratic People's Republic of" works, but "Korea, Democratic Republic of" does not work with the current dev version

Olucik avatar Sep 18 '17 19:09 Olucik

I see. Is that a common form? And how would you modify the regex?

dprk|d.p.r.k|korea.+(d.p.r|dpr|north|dem.*peo.*rep.*)|(d.p.r|dpr|north|dem.*peo.*rep.*).+korea

vincentarelbundock avatar Sep 18 '17 20:09 vincentarelbundock

it's how Social Progress Index names it (for all the years). I haven't seen it anywhere else yet. May be it still falls under "custom" category

Olucik avatar Sep 18 '17 20:09 Olucik

good to know. I just tried googling the expression in quotes, and can't really find other instances. I think I'll leave this issue open for future consideration, but not do anything for now.

Sorry if that seems unresponsive, I'm just not 100% convinced by the use-case, and I'm super busy at work these days (and trying to minimize non-essential tasks).

vincentarelbundock avatar Sep 18 '17 20:09 vincentarelbundock

thank you!

Olucik avatar Sep 18 '17 20:09 Olucik

Just hit this one, "Central African Rep.", in CEPII IPD 2012...

> countrycode("Central African Rep.", "country.name", "country.name")
[1] NA
Warning message:
In countrycode("Central African Rep.", "country.name", "country.name") :
  Some values were not matched unambiguously: Central African Rep.

cjyetman avatar Nov 10 '17 15:11 cjyetman

What's your view on abbreviations? Should those be countrycode's responsibility? I suppose we could make it countrycode's responsibility, but slippery slope, etc.

vincentarelbundock avatar Nov 10 '17 15:11 vincentarelbundock

I think we're already on that slope since we currently do: "Korea, Rep. of", "Rep. of Korea", "U.S.A.", "D.P.R. Korea", "D.R. Congo", "U.S. Virgin Islands", etc. Since there's a finite number of country names, and an even smaller finite list of words/names within that which can be reasonably abbreviated, that slope may be slippery, but not so long.

And since the regex codes are really the core value of the package, than maybe they should be as adaptable as possible? Of course, what does/should trump all of that is who/when/how does it get done, does it negatively affect anything else, etc., which are completely valid concerns also.

cjyetman avatar Nov 10 '17 17:11 cjyetman

I'm convinced.

vincentarelbundock avatar Nov 10 '17 17:11 vincentarelbundock

changed the title because this thread eventually led to an agreement that some abbreviations should be considered for addition to the regexes

cjyetman avatar Apr 25 '18 08:04 cjyetman

Thanks again for opening this issue. If someone has very specific suggestions for changes to the regular expressions, I encourage them to create a Pull Request by modifying the dictionary/data_regex.csv file.

For now, we'll close this issue to tidy up the repo.

vincentarelbundock avatar Aug 24 '22 16:08 vincentarelbundock