countrycode icon indicating copy to clipboard operation
countrycode copied to clipboard

Enable Chinese as origin language

Open turbanisch opened this issue 1 year ago • 7 comments

So far, there seems to be no way of converting country names from Chinese, not even by left-joining any of the dataframes that come with countrycode. I assume the reason for this is the lack of a corresponding set of regexes? If there is any interest, I would like to offer to come up with one!

turbanisch avatar Aug 25 '22 19:08 turbanisch

If you can come up with regexes, that would lovely!

But have you tried the countryname function? If so, what problem did you run into?

vincentarelbundock avatar Aug 25 '22 19:08 vincentarelbundock

Since {countrycode} does include CLDR names in Chinese, one could (rather easily) create a custom dictionary using the included codelist to achieve exact matching, though it will not do fancy regex matching, e.g...

library(countrycode)

custom_dict <-
  unique(countrycode::codelist[, c("cldr.short.zh", "country.name.en", "iso3c")])

countrycode(
  "阿拉伯联合酋长国",
  origin = "cldr.short.zh",
  destination = "country.name.en",
  custom_dict = custom_dict
)
#> [1] "United Arab Emirates"

countrycode(
  "阿拉伯联合酋长国",
  origin = "cldr.short.zh",
  destination = "iso3c",
  custom_dict = custom_dict
)
#> [1] "ARE"

But yes, fancy Chinese regexes would be very interesting!

cjyetman avatar Aug 25 '22 19:08 cjyetman

Ah, good idea, @cjyetman ! I think eventually it would be nice to have a more lenient regex matching because even data from Chinese authorities is messy, for some countries it may list the full country name, for others just an abbreviated one.

I had quickly discarded countryname but apparently I just picked the wrong examples: neither Germany nor France have Chinese versions and the one for China matches only the full country name (aka "People's Republic of China"):

library(tidyverse)
library(countrycode)

# no matches
countryname("德国")
#> Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: 德国
#> [1] NA
countryname("中国")
#> Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: 中国
#> [1] NA

# there are Chinese country names in the data but not for all countries
# some match the official country name and may be too strict
countrycode::countryname_dict %>% 
  as_tibble() %>% 
  filter(str_detect(country.name.alt, "\\p{script=Han}")) %>% 
  filter(country.name.en %in% c("Germany", "France", "China"))
#> # A tibble: 2 × 2
#>   country.name.en country.name.alt
#>   <chr>           <chr>           
#> 1 China           中华人民共和国  
#> 2 China           中華人民共和國

Created on 2022-08-26 by the reprex package (v2.0.1)

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.0 (2022-04-22)
#>  os       macOS Monterey 12.5
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Berlin
#>  date     2022-08-26
#>  pandoc   2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.2.0)
#>  backports     1.4.1   2021-12-13 [1] CRAN (R 4.2.0)
#>  broom         1.0.0   2022-07-01 [1] CRAN (R 4.2.0)
#>  cellranger    1.1.0   2016-07-27 [1] CRAN (R 4.2.0)
#>  cli           3.3.0   2022-04-25 [1] CRAN (R 4.2.0)
#>  colorspace    2.0-3   2022-02-21 [1] CRAN (R 4.2.0)
#>  countrycode * 1.4.0   2022-05-04 [1] CRAN (R 4.2.0)
#>  crayon        1.5.1   2022-03-26 [1] CRAN (R 4.2.0)
#>  DBI           1.1.2   2021-12-20 [1] CRAN (R 4.2.0)
#>  dbplyr        2.2.0   2022-06-05 [1] CRAN (R 4.2.0)
#>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.2.0)
#>  dplyr       * 1.0.9   2022-04-28 [1] CRAN (R 4.2.0)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate      0.15    2022-02-18 [1] CRAN (R 4.2.0)
#>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
#>  forcats     * 0.5.1   2021-01-27 [1] CRAN (R 4.2.0)
#>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.2.0)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.2.0)
#>  ggplot2     * 3.3.6   2022-05-03 [1] CRAN (R 4.2.0)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  gtable        0.3.0   2019-03-25 [1] CRAN (R 4.2.0)
#>  haven         2.5.0   2022-04-15 [1] CRAN (R 4.2.0)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.2.0)
#>  hms           1.1.1   2021-09-26 [1] CRAN (R 4.2.0)
#>  htmltools     0.5.3   2022-07-18 [1] CRAN (R 4.2.0)
#>  httr          1.4.3   2022-05-04 [1] CRAN (R 4.2.0)
#>  jsonlite      1.8.0   2022-02-22 [1] CRAN (R 4.2.0)
#>  knitr         1.39    2022-04-26 [1] CRAN (R 4.2.0)
#>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.2.0)
#>  lubridate     1.8.0   2021-10-07 [1] CRAN (R 4.2.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  modelr        0.1.8   2020-05-19 [1] CRAN (R 4.2.0)
#>  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.2.0)
#>  pillar        1.8.0   2022-07-18 [1] CRAN (R 4.2.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr       * 0.3.4   2020-04-17 [1] CRAN (R 4.2.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  readr       * 2.1.2   2022-01-30 [1] CRAN (R 4.2.0)
#>  readxl        1.4.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.2.0)
#>  rlang         1.0.4   2022-07-12 [1] CRAN (R 4.2.0)
#>  rmarkdown     2.14    2022-04-25 [1] CRAN (R 4.2.0)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.2.0)
#>  rvest         1.0.2   2021-10-16 [1] CRAN (R 4.2.0)
#>  scales        1.2.0   2022-04-13 [1] CRAN (R 4.2.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi       1.7.8   2022-07-11 [1] CRAN (R 4.2.0)
#>  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 4.2.0)
#>  tibble      * 3.1.8   2022-07-22 [1] CRAN (R 4.2.0)
#>  tidyr       * 1.2.0   2022-02-01 [1] CRAN (R 4.2.0)
#>  tidyselect    1.1.2   2022-02-21 [1] CRAN (R 4.2.0)
#>  tidyverse   * 1.3.1   2021-04-15 [1] CRAN (R 4.2.0)
#>  tzdb          0.3.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs         0.4.1   2022-04-13 [1] CRAN (R 4.2.0)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.31    2022-05-10 [1] CRAN (R 4.2.0)
#>  xml2          1.3.3   2021-11-30 [1] CRAN (R 4.2.0)
#>  yaml          2.3.5   2022-02-21 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

turbanisch avatar Aug 25 '22 22:08 turbanisch

I took a bit of a deep dive and came with regular expressions for Chinese by scraping Wikipedia. I outlined the issue (in short: Chinese has many variants, not just simplified vs. traditional scripts but also depending on regions) here and implemented a function as a proof of concept here.

Do let me know if you would like to incorporate the regular expressions into countrycode and I will try to come up with a PR! I assume this is how I would have to prepare the codes? https://github.com/vincentarelbundock/countrycode#adding-a-new-code

There is one issue though that you might want to handle differently than I did: my function converts the input into simplified characters before matching via regular expressions. I discuss alternative implementations in the README mentioned above (https://github.com/turbanisch/chinese-countryname-regex). In short, for maintenance reasons I would suggest only adding the regular expresssions for simplified Chinese and perhaps add a note in the countrycode documentation telling the user to apply a traditional-to-simplified conversion herself as needed.

turbanisch avatar Sep 25 '22 22:09 turbanisch

@turbanisch this is really cool, thanks for all your effort so far

just a suggestion if/when you start building a PR... I would suggest following the model of adding known variants of each country to a data file for the tests, and then testing each one of them in a {testthat} test, as done here... https://github.com/vincentarelbundock/countrycode/blob/main/tests/testthat/data-known-name-variations.R https://github.com/vincentarelbundock/countrycode/blob/main/tests/testthat/test-known-name-variations.R

cjyetman avatar Sep 26 '22 15:09 cjyetman

That all sounds great! Thanks for putting this together.

I wont be able to look at anything in the near term, but feel free to open a PR whenever you are ready, and Ill try to review when possible.

Also agree with the know variations tests. Those would be super useful.

vincentarelbundock avatar Sep 27 '22 21:09 vincentarelbundock

Thanks a lot to both of you! I will see how far I get and prepare a PR in the coming days.

turbanisch avatar Sep 27 '22 22:09 turbanisch