countrycode
countrycode copied to clipboard
COW codes Germany/Vietnam
The countrycode version 1.00.0 doesn't mach COW codes for Germany (260) and Vietnam (816), ie. countrycode(c(260,816), "cown", "country.name") outputs (NA,NA)
These two countries are tricky, because CoW assigns different numerical identifiers depending on the year.
In its cross-sectional incarnation, countrycode
must use a one-to-one mapping, which forces us to choose one and only one cown
code per country. Currently, we have:
library(countrycode) countrycode('Germany', 'country.name', 'cown') [1] 255 countrycode('Vietnam', 'country.name', 'cown') [1] 817
Are these suboptimal, you think?
If you are looking at panel data, a better option would be to use the country year dataset that is packaged with countrycode: countrycode::codelist_panel
Another alternative is to use the custom_match
argument:
> library(countrycode)
> countrycode(c(816, 817, 255, 260), 'cown', 'country.name', custom_match = c('816' = 'Germany', '260' = 'Vietnam'))
[1] "Germany" "Vietnam" "Germany" "Vietnam"
>
I see your point going from names to COW codes, but why:
countrycode(c(260,816), "cown", "country.name")
[1] NA NA
This should be
[1] German Federal Republic, Vietnam
according to COW system membership file.
I don't understand what you mean. Can you show me in both directions what you want? And please add iso3c for reference.
For instance:
Germany -> 817
Germany -> DEU
German Federal Republic -> DEU
German Federal Republic -> 816
DEU -> Germany
816 -> German Federal Republic
817 -> Germany
Obviously, this breaks the one-to-one mapping condition...
According to COW http://www.correlatesofwar.org/data-sets/state-system-membership 260 = "German Federal Republic" and 255 = "Germany". I would expect that countrycode(260, "cown", "country.name") always outputs "German Federal Republic" since that name is assigned to the code 260 independent of any year. I understand that the output for countrycode("German Federal Republic", "country.name", "cown") depends on the year.
The way countrycode
works, we need the dictionary to work in BOTH DIRECTIONS. There cannot be duplicate entries, or asymmetric mappings. This is why we need a strict one-to-one map. There cannot be any one-to-two, or two-to-one.
Since the cown of "German Federal Republic" depends on the year, we need to choose either 260 or 255 in order to preserve that one-to-one symmetric mapping. To some extent, the choice is arbitrary. Currently, countrycode
chose 255, which is why 260 doesn't produce anything.
The proper way to deal with this issue is to use a panel conversion dictionary instead of a cross-sectional one. This is why countrycode
ships with the codelist_panel
data.frame.
I think the problem mainly arises from the ambiguous name "Federal Republic of Germany" and its ambiguous variations. If we agreed to always refer to the former "Federal Republic of Germany" as "West Germany" only, in regex and country.name.en etc., then it might be workable. We do convert "East Germany" to 265.
Right. But that's a problem with CoW's coding, which we can't do anything about.
stateabb | ccode | statenme | styear | stmonth | stday | endyear | endmonth | endday | version
GMY | 255 | Germany | 1816 | 1 | 1 | 1945 | 5 | 8 | 2016
GMY | 255 | Germany | 1990 | 10 | 3 | 2016 | 12 | 31 | 2016
GFR | 260 | German Federal Republic | 1955 | 5 | 5 | 1990 | 10 | 2 | 2016
GDR | 265 | German Democratic Republic | 1954 | 3 | 25 | 1990 | 10 | 2 | 2016
as far as I can tell, these are always true in CoW... cowc: GFR == cown: 260 == cow.name: "German Federal Republic" == "West Germany" cowc: GMY == cown: 255 == cow.name: "Germany" == "Germany" (as in current Germany, and pre-1946 Germany)
so these could always work, not conflict, and always be directly reversible... cown: 260 -> "West Germany" "West Germany" -> cown: 260
that only becomes a problem if we want to allow something like (as some coding schemes do)... "West Germany" -> iso3c: DEU "Germany" -> iso3c: DEU because then when you try to reverse, you'll have two matches
if we had...
country.name.en | country.name.en.regex | cown | cowc | cow.name | iso3c | every other column |
---|---|---|---|---|---|---|
West Germany | "West Germany" | 260 | GFR | German Federal Republic | NA |
NA |
Germany | [current regex for Germany (assuming it doesn't match "West Germany")] | 255 | GMY | Germany | DEU | [as it currently is] |
it seems like it should work if we don't allow "West Germany" to be converted into anything else that's not definitively, unambiguously, exclusively equivalent to "West Germany"
That's exactly the problem. And it's not just iso3c, it's "all other destinations". My sense is that CoW codes are popular, but that they are definitely a minority use case relative to "West Germany" -> All other codes. I think we want to keep the latter, even if it costs us the former.
But of course, I don't have any systematic data to back that up. Just my own practice.
got it... so to clarify a bit, doing this would screw up some of the other (possibly more common) codes which prefer to essentially view "West Germany" and "current Germany" as the same thing
Yes.
Do you share the intuition that those use-cases are more common?
in my research yes, but I think it's a shame because CoW is one of very few robust country codes schemes one can use for time series that go back ~20+ years
btw... there is a row in the current codelist
for "Federal Republic of Germany" separate from the "Germany" row, but it is always NA
except for country.name.en
, country.name.en.regex
, country.name.de
, country.name.de.regex
. Based on this discussion, that should probably not be there.
Good catch. I removed that line in this commit: https://github.com/vincentarelbundock/countrycode/commit/51f4a6f1d3ffc841f387b3bdbfe71b8029fb79eb
That whole discussion reinforces my belief that mostly, what we should be using is codelist_panel
. The build process for that is pretty hackish, and I'm sure we could improve the substantive choices made, but that seems like a more robust avenue.
How is codelist_panel
intended to be used?
This is somewhat convoluted, and the result isn't great either?
library(tibble)
library(dplyr)
library(countrycode)
df <- tribble(
~country, ~year, ~var,
260, 1985, 1,
265, 1985, 2,
255, 2000, 3
)
df %>%
left_join(codelist_panel[, c("cown", "country.name.en", "year")],
by = c("country" = "cown", "year" = "year"))
# # A tibble: 3 x 4
# country year var country.name.en
# <dbl> <dbl> <dbl> <chr>
# 1 260 1985 1 Germany
# 2 265 1985 2 German Democratic Republic
# 3 255 2000 3 Germany
I think it would be good to think about...
- a good example of how to use it
- a better way to use it
The one-to-one matching logic seems to be a major departure from the previous package version as I recall it. May I ask, what's the reason for this departure?
No departure. It was always like that.
But in previous package versions I never got "NA" for a call countrycode(260, "cown", "country.name"). Why now? Up to version 0.19 from February this year, I get "Federal Republic of Germany" but now with version 1.0.0 its NA.
@cjyetman
I'm not sure why you find the left_join
cumbersome. My current workflow looks like this:
> # Load
> library(tidyverse)
> library(countrycode)
>
> # Simulate data
> x <- tribble(
+ ~cown, ~year, ~var1,
+ 255, 1985, 1,
+ 255, 2000, 3,
+ 2, 1985, 2
+ )
> y <- tribble(
+ ~iso3c, ~year, ~var2,
+ 'DEU', 1985, 1,
+ 'DEU', 2000, 5,
+ 'USA', 1985, 2,
+ 'DZA', 2000, 3
+ )
>
> # Left merge into panel data
> panel <- countrycode::codelist_panel %>%
+ select(iso3c, cown, year)
> panel <- purrr::reduce(list(panel, x, y), left_join)
Joining, by = c("cown", "year")
Joining, by = c("iso3c", "year")
>
> # Check the result
> panel %>% filter(!is.na(var1) | !is.na(var2))
iso3c cown year var1 var2
1 DZA 615 2000 NA 3
2 DEU 260 1985 NA 1
3 DEU 255 2000 3 5
4 USA 2 1985 2 2
It just seems a bit awkward, especially for a package that generally makes things incredibly easy. I don't have a solution in mind, but I might start thinking of one.
Sounds good. Let me know if you have some ideas. I'm very interested.
w.r.t. to the current issue, do you think we should include entries in dictionary_static
with empty regex but values for cown and country.name.en? Those entries would be functionality-constrained, but it would solve this user's specific problem.
The important thing would be to ensure none of the entries duplicate what's in other rows (but our test suite should catch that).
@sumtxt I suppose it could be considered a bug...
devtools::install_github("vincentarelbundock/countrycode", ref = "0.19")
library(countrycode)
countrycode(260, "cown", "country.name")
# [1] "Federal Republic of Germany"
countrycode("Federal Republic of Germany", "country.name", "cown")
# [1] NA
# Warning messages:
# 1: In countrycode("Federal Republic of Germany", "country.name", "cown") :
# Some values were not matched unambiguously: Federal Republic of Germany
#
# 2: In countrycode("Federal Republic of Germany", "country.name", "cown") :
# Some strings were matched more than once, and therefore set to <NA> in the result: Federal Republic of Germany,260,255
arrgh, that's can't really work, since now we're merging the dictionary based on unique regexes
Edit: And we want the regex for "Federal Republic of Germany" to map onto the other code.
fyi... signing off for a bit... I'll come back to this in the coming days
Thanks again for opening this issue. I still recognize that this is a problem, but unfortunately this problem can only be satisfactorily addressed by a fundamental re-write of countrycode
, to allow "many-to-one" mappings.
We already have a (dormant but open) issue for this, which you can follow if there is interest: https://github.com/vincentarelbundock/countrycode/issues/186
I have also note the problem in a new "Frequently Requested Feature" issue, so I will close this one here. I'm only closing to avoid duplication and to clean up the repo. If Many-to-One is implemented, the specific case of Germany CoW codes will automatically be solved.
Thanks for your patience!