dataprep icon indicating copy to clipboard operation
dataprep copied to clipboard

clean_country() for countries belonging to UK are not recognized as country

Open FabianPalmaPando opened this issue 2 years ago • 3 comments

clean_country() applied to England and Scotland throws NaN. I believe this would happen for all countries belonging to UK. It would be nice if the function recognices both cases: United Kingdom and England (for example) as different countries, depending on the input.

thanks for creating such an amazing library! :)

FabianPalmaPando avatar Nov 18 '21 17:11 FabianPalmaPando

Hi! Thank you for your brilliant advice. You're right that we need to consider details of different counties! Also, if you are interested in, welcome to update what you like into country_data.tsv and open a PR!

qidanrui avatar Nov 18 '21 18:11 qidanrui

Hi just started looking at this project! it looks amazing @qidanrui ! :)

Btw this issue is because countries inside UK are not ISO countries (list here (wikipedia), you can see that Ireland is here but not northern one neither England). I saw that some similar issue is on this PHP repo umpirsky/country-list.

maybe an option in clean_country() would be nice ? clean_country( include_non_iso = TRUE OR FALSE default FALSE) in order to include the data from country_data.tsv and from a new file country_non_iso_data.tsv (with list of uk countries and maybe more if there is 🤔 ) as apparently ISO is the norm in all country lists and packages

moreaupascal56 avatar Nov 25 '21 10:11 moreaupascal56

Btw an other issue is that as these are not ISO countries but ISO "principal subdivisions of a country". The ISO codes are connected to the UK ones like GB-ENG for England (https://en.wikipedia.org/wiki/ISO_3166-2:GB) so we don't have proper values for àlpha-2 alpha-3 and numeric columns (regex neither but we can put country name).

I saw that NaN values are no problem in country_data.tsv but I guess the codes are strings with 2 or 3 len max

moreaupascal56 avatar Nov 25 '21 10:11 moreaupascal56