countrycode icon indicating copy to clipboard operation
countrycode copied to clipboard

COW codes Germany/Vietnam

Open sumtxt opened this issue 6 years ago • 26 comments

The countrycode version 1.00.0 doesn't mach COW codes for Germany (260) and Vietnam (816), ie. countrycode(c(260,816), "cown", "country.name") outputs (NA,NA)

sumtxt avatar May 31 '18 15:05 sumtxt

These two countries are tricky, because CoW assigns different numerical identifiers depending on the year.

In its cross-sectional incarnation, countrycode must use a one-to-one mapping, which forces us to choose one and only one cown code per country. Currently, we have:

library(countrycode) countrycode('Germany', 'country.name', 'cown') [1] 255 countrycode('Vietnam', 'country.name', 'cown') [1] 817

Are these suboptimal, you think?

If you are looking at panel data, a better option would be to use the country year dataset that is packaged with countrycode: countrycode::codelist_panel

vincentarelbundock avatar May 31 '18 15:05 vincentarelbundock

Another alternative is to use the custom_match argument:


> library(countrycode)
> countrycode(c(816, 817, 255, 260), 'cown', 'country.name', custom_match = c('816' = 'Germany', '260' = 'Vietnam'))
[1] "Germany" "Vietnam" "Germany" "Vietnam"
>

vincentarelbundock avatar May 31 '18 15:05 vincentarelbundock

I see your point going from names to COW codes, but why:

countrycode(c(260,816), "cown", "country.name")
[1] NA NA

This should be

[1] German Federal Republic, Vietnam

according to COW system membership file.

sumtxt avatar May 31 '18 15:05 sumtxt

I don't understand what you mean. Can you show me in both directions what you want? And please add iso3c for reference.

For instance:

Germany -> 817
Germany -> DEU
German Federal Republic -> DEU
German Federal Republic -> 816
DEU -> Germany
816 -> German Federal Republic
817 -> Germany

Obviously, this breaks the one-to-one mapping condition...

vincentarelbundock avatar May 31 '18 15:05 vincentarelbundock

According to COW http://www.correlatesofwar.org/data-sets/state-system-membership 260 = "German Federal Republic" and 255 = "Germany". I would expect that countrycode(260, "cown", "country.name") always outputs "German Federal Republic" since that name is assigned to the code 260 independent of any year. I understand that the output for countrycode("German Federal Republic", "country.name", "cown") depends on the year.

sumtxt avatar May 31 '18 16:05 sumtxt

The way countrycode works, we need the dictionary to work in BOTH DIRECTIONS. There cannot be duplicate entries, or asymmetric mappings. This is why we need a strict one-to-one map. There cannot be any one-to-two, or two-to-one.

Since the cown of "German Federal Republic" depends on the year, we need to choose either 260 or 255 in order to preserve that one-to-one symmetric mapping. To some extent, the choice is arbitrary. Currently, countrycode chose 255, which is why 260 doesn't produce anything.

The proper way to deal with this issue is to use a panel conversion dictionary instead of a cross-sectional one. This is why countrycode ships with the codelist_panel data.frame.

vincentarelbundock avatar May 31 '18 16:05 vincentarelbundock

I think the problem mainly arises from the ambiguous name "Federal Republic of Germany" and its ambiguous variations. If we agreed to always refer to the former "Federal Republic of Germany" as "West Germany" only, in regex and country.name.en etc., then it might be workable. We do convert "East Germany" to 265.

cjyetman avatar May 31 '18 16:05 cjyetman

Right. But that's a problem with CoW's coding, which we can't do anything about.


stateabb | ccode | statenme | styear | stmonth | stday | endyear | endmonth | endday | version
GMY | 255 | Germany | 1816 | 1 | 1 | 1945 | 5 | 8 | 2016
GMY | 255 | Germany | 1990 | 10 | 3 | 2016 | 12 | 31 | 2016
GFR | 260 | German Federal Republic | 1955 | 5 | 5 | 1990 | 10 | 2 | 2016
GDR | 265 | German Democratic Republic | 1954 | 3 | 25 | 1990 | 10 | 2 | 2016

vincentarelbundock avatar May 31 '18 17:05 vincentarelbundock

as far as I can tell, these are always true in CoW... cowc: GFR == cown: 260 == cow.name: "German Federal Republic" == "West Germany" cowc: GMY == cown: 255 == cow.name: "Germany" == "Germany" (as in current Germany, and pre-1946 Germany)

so these could always work, not conflict, and always be directly reversible... cown: 260 -> "West Germany" "West Germany" -> cown: 260

that only becomes a problem if we want to allow something like (as some coding schemes do)... "West Germany" -> iso3c: DEU "Germany" -> iso3c: DEU because then when you try to reverse, you'll have two matches

if we had...

country.name.en country.name.en.regex cown cowc cow.name iso3c every other column
West Germany "West Germany" 260 GFR German Federal Republic NA NA
Germany [current regex for Germany (assuming it doesn't match "West Germany")] 255 GMY Germany DEU [as it currently is]

it seems like it should work if we don't allow "West Germany" to be converted into anything else that's not definitively, unambiguously, exclusively equivalent to "West Germany"

cjyetman avatar May 31 '18 17:05 cjyetman

That's exactly the problem. And it's not just iso3c, it's "all other destinations". My sense is that CoW codes are popular, but that they are definitely a minority use case relative to "West Germany" -> All other codes. I think we want to keep the latter, even if it costs us the former.

vincentarelbundock avatar May 31 '18 18:05 vincentarelbundock

But of course, I don't have any systematic data to back that up. Just my own practice.

vincentarelbundock avatar May 31 '18 18:05 vincentarelbundock

got it... so to clarify a bit, doing this would screw up some of the other (possibly more common) codes which prefer to essentially view "West Germany" and "current Germany" as the same thing

cjyetman avatar May 31 '18 18:05 cjyetman

Yes.

Do you share the intuition that those use-cases are more common?

vincentarelbundock avatar May 31 '18 18:05 vincentarelbundock

in my research yes, but I think it's a shame because CoW is one of very few robust country codes schemes one can use for time series that go back ~20+ years

cjyetman avatar May 31 '18 18:05 cjyetman

btw... there is a row in the current codelist for "Federal Republic of Germany" separate from the "Germany" row, but it is always NA except for country.name.en, country.name.en.regex, country.name.de, country.name.de.regex. Based on this discussion, that should probably not be there.

cjyetman avatar May 31 '18 18:05 cjyetman

Good catch. I removed that line in this commit: https://github.com/vincentarelbundock/countrycode/commit/51f4a6f1d3ffc841f387b3bdbfe71b8029fb79eb

That whole discussion reinforces my belief that mostly, what we should be using is codelist_panel. The build process for that is pretty hackish, and I'm sure we could improve the substantive choices made, but that seems like a more robust avenue.

vincentarelbundock avatar May 31 '18 18:05 vincentarelbundock

How is codelist_panel intended to be used?

This is somewhat convoluted, and the result isn't great either?

library(tibble)
library(dplyr)
library(countrycode)

df <- tribble(
  ~country, ~year, ~var,
  260,      1985,  1,
  265,      1985,  2,
  255,      2000,  3
)

df %>% 
  left_join(codelist_panel[, c("cown", "country.name.en", "year")], 
            by = c("country" = "cown", "year" = "year"))

# # A tibble: 3 x 4
#   country  year   var country.name.en           
#     <dbl> <dbl> <dbl> <chr>                     
# 1     260  1985     1 Germany                   
# 2     265  1985     2 German Democratic Republic
# 3     255  2000     3 Germany     

I think it would be good to think about...

  1. a good example of how to use it
  2. a better way to use it

cjyetman avatar May 31 '18 19:05 cjyetman

The one-to-one matching logic seems to be a major departure from the previous package version as I recall it. May I ask, what's the reason for this departure?

sumtxt avatar May 31 '18 19:05 sumtxt

No departure. It was always like that.

vincentarelbundock avatar May 31 '18 19:05 vincentarelbundock

But in previous package versions I never got "NA" for a call countrycode(260, "cown", "country.name"). Why now? Up to version 0.19 from February this year, I get "Federal Republic of Germany" but now with version 1.0.0 its NA.

sumtxt avatar May 31 '18 19:05 sumtxt

@cjyetman

I'm not sure why you find the left_join cumbersome. My current workflow looks like this:

> # Load
> library(tidyverse)
> library(countrycode)
>
> # Simulate data
> x <- tribble(
+ ~cown, ~year, ~var1,
+ 255,      1985,  1,
+ 255,      2000,  3,
+ 2,      1985,  2
+ )
> y <- tribble(
+ ~iso3c, ~year, ~var2,
+ 'DEU',      1985,  1,
+ 'DEU',      2000,  5,
+ 'USA',      1985,  2,
+ 'DZA',      2000,  3
+ )
>
> # Left merge into panel data
> panel <- countrycode::codelist_panel %>%
+          select(iso3c, cown, year)
> panel <- purrr::reduce(list(panel, x, y), left_join)
Joining, by = c("cown", "year")
Joining, by = c("iso3c", "year")
>
> # Check the result
> panel %>% filter(!is.na(var1) | !is.na(var2))
  iso3c cown year var1 var2
1   DZA  615 2000   NA    3
2   DEU  260 1985   NA    1
3   DEU  255 2000    3    5
4   USA    2 1985    2    2

vincentarelbundock avatar May 31 '18 19:05 vincentarelbundock

It just seems a bit awkward, especially for a package that generally makes things incredibly easy. I don't have a solution in mind, but I might start thinking of one.

cjyetman avatar May 31 '18 19:05 cjyetman

Sounds good. Let me know if you have some ideas. I'm very interested.

w.r.t. to the current issue, do you think we should include entries in dictionary_static with empty regex but values for cown and country.name.en? Those entries would be functionality-constrained, but it would solve this user's specific problem.

The important thing would be to ensure none of the entries duplicate what's in other rows (but our test suite should catch that).

vincentarelbundock avatar May 31 '18 19:05 vincentarelbundock

@sumtxt I suppose it could be considered a bug...

devtools::install_github("vincentarelbundock/countrycode", ref = "0.19")
library(countrycode)
countrycode(260, "cown", "country.name")
# [1] "Federal Republic of Germany"
countrycode("Federal Republic of Germany", "country.name", "cown")
# [1] NA
# Warning messages:
#   1: In countrycode("Federal Republic of Germany", "country.name", "cown") :
#   Some values were not matched unambiguously: Federal Republic of Germany
# 
# 2: In countrycode("Federal Republic of Germany", "country.name", "cown") :
#   Some strings were matched more than once, and therefore set to <NA> in the result: Federal Republic of Germany,260,255

cjyetman avatar May 31 '18 19:05 cjyetman

arrgh, that's can't really work, since now we're merging the dictionary based on unique regexes

Edit: And we want the regex for "Federal Republic of Germany" to map onto the other code.

vincentarelbundock avatar May 31 '18 19:05 vincentarelbundock

fyi... signing off for a bit... I'll come back to this in the coming days

cjyetman avatar May 31 '18 19:05 cjyetman

Thanks again for opening this issue. I still recognize that this is a problem, but unfortunately this problem can only be satisfactorily addressed by a fundamental re-write of countrycode, to allow "many-to-one" mappings.

We already have a (dormant but open) issue for this, which you can follow if there is interest: https://github.com/vincentarelbundock/countrycode/issues/186

I have also note the problem in a new "Frequently Requested Feature" issue, so I will close this one here. I'm only closing to avoid duplication and to clean up the repo. If Many-to-One is implemented, the specific case of Germany CoW codes will automatically be solved.

Thanks for your patience!

vincentarelbundock avatar Aug 24 '22 20:08 vincentarelbundock