dataprep clean: `clean_country` cannot recognize some countries names

Describe the bug Some countries, such as Scotland and Yugoslavia, cannot be recognized by clean_country.

To reproduce from this dataset:

Currently "Scotland", "England", "Wales" cannot be recognized by Dataprep.clean. It may be more appropriate to transform them into "United Kingdom" or just keep it as it is?

Besides, some countries that existed in the past can not be recognized, such as Yugoslavia and Vietnam Republic/South Vietnam. I think it would be a good idea to include them and add a label "(former)", just as how we treat Soviet Union:

Apr 27 '21 07:04 NoirTree

Great job! @NoirTree Hi, Ryan @ryanwdale. Does our implementation consider the country names in the past? Is it necessary to take these past country names into consideration? Could you tell us your opinion? Thanks!

Apr 27 '21 08:04 qidanrui

Thanks for the detailed issue.

Currently "Scotland", "England", "Wales" cannot be recognized by Dataprep.clean. It may be more appropriate to transform them into "United Kingdom" or just keep it as it is?

I would lean towards just leaving this as is because some users may not want this behaviour and we may have to do something similar in other cases to be consistent so it adds complexity.

some countries that existed in the past can not be recognized

There's already a number of countries from the past that are supported, so it would be good to add more. I think I would prefer not labelling them as (former) though. We could also eventually add a parameter include_obsolete that allows users to specify whether they want to include countries from the past. Are you interested in working on this @NoirTree? It should just require adding the desired countries to this file https://github.com/sfu-db/dataprep/blob/develop/dataprep/clean/country_data.tsv

Apr 27 '21 21:04 ryanwdale

We're also planning on adding a feature that allows users to pass a file containing country data to the clean_country() and validate_country() functions. This would allow the user to have control over which countries they clean and the regexes used to match the countries.

Apr 28 '21 00:04 ryanwdale

Thanks for your reply! @ryanwdale

The idea of include_obsolete is really good! And I will add those countries to the file soon. But I'm also wondering maybe we can find some more complete country lists (like this one, which contains a lot of former countries) and fill more missing parts at one time?

We're also planning on adding a feature that allows users to pass a file containing country data to the clean_country() and validate_country() functions. This would allow the user to have control over which countries they clean and the regexes used to match the countries.

I think allowing users to design regexes will be a good feature, but it would be hard for them to provide a more complete country list than what we have. Maybe we can return the most complete results we can obtain, and allow users to decide which parts are not desired. (For example, clean_country may transform "England" to "United Kingdom" by default, but the user can change this behavior by setting some parameters)

Apr 29 '21 09:04 NoirTree

The list of countries looks good, it could definitely be helpful.

Parameters would be nice if it makes it easier to make a small change, the parameters might get a bit complicated though. I was thinking in most cases the user would just copy our file and make any changes they want to it. We could maybe do both if there's a clean way of adding a parameter for this.

Apr 29 '21 19:04 ryanwdale

Hey, @ryanwdale ! Here is my idea in detail:

Since it's hard for us to include all the countries in the world, and sometimes users may have preference over how to deal with certain country, we can just leave the choice to them.

To make it more user-friendly, we can design a GUI like the one in clean_duplication. Here is an example:

For countries clean_country cannot recognize, we can provide a GUI looks like this:

Value			Appear Times		Valid Country?		Cleaned Value
--------------------------------------------------------------------------------------------------------
Yugoslavia		1			o yes o no
--------------------------------------------------------------------------------------------------------
England			35			o yes o no
--------------------------------------------------------------------------------------------------------
heyyyy			1			o yes o no
--------------------------------------------------------------------------------------------------------
Serbia and Montenegro	2			o yes o no
--------------------------------------------------------------------------------------------------------

Then the user can make some choices:


Value			Appear Times		Valid Country?					Cleaned Value
----------------------------------------------------------------------------------------------------------------------
Yugoslavia		1			√yes o no # valid country but not included	Yugoslavia
----------------------------------------------------------------------------------------------------------------------
England			35			√yes o no # leave the choice to the user	United Kingdom
----------------------------------------------------------------------------------------------------------------------
heyyyy			1			o yes √no # invalid country
----------------------------------------------------------------------------------------------------------------------
Serbia and Montenegro	2			o yes √no # user does not desire it
----------------------------------------------------------------------------------------------------------------------

This will save us abundant time to search for a complete country list, and leave room for user preference.

Apr 30 '21 03:04 NoirTree

Good idea, @NoirTree !! This is similar to faceting in OpenRefine: https://docs.openrefine.org/manual/facets https://librarycarpentry.org/lc-open-refine/04-faceting-and-filtering/index.html

I think this function is useful for not only clean_country but also other clean functions (e.g., clean_date). We can consider creating a new clean_facet() function.

Apr 30 '21 04:04 jnwang

Thanks for the resources, @jnwang !

I think the idea of text clustering and mass-edit in OpenRefine is really useful. That is where we can combine the clean_duplication with other clean functions. But maybe it would be even more useful to embed this functionality in each clean function instead of creating a new one? For example, the text clustering algorithm can be modified according to the semantic data type (e.g. address, country), which may work even better than a general-purpose one (e.g. fingerprint in clean_duplication).

Apr 30 '21 08:04 NoirTree

This is a good point and it’s worth exploring. Since adding clean_duplication to every function will require big changes to the code base, let’s go with the following simple solution at this point.

It looks like what we need is to put the same values into one cluster and allow the user to edit them altogether. This naive clustering method hasn’t been added to clean_duplication and we can consider adding it.

Once this is added, the user can first call clean_country and then apply clean_duplication to the unrecognized countries.

Apr 30 '21 15:04 jnwang

I totally agree with you, @jnwang ! The solution is quite useful and feasible at the same time. Thanks for the great idea!

May 01 '21 02:05 NoirTree

Few other nationalities that are not properly handled by dataprep.

'French' 'Frence' 'English' 'Chinese' 'American' 'Thai'

Jul 23 '21 16:07 arky

Hey @arky , thanks for your comments!

I test them on my machine and find similar problems. But "Frence" can be handled by setting fuzzy_dist as 1 or something higher.

For the rest, since they are not country (should be a noun), clean_country cannot handle them properly even with fuzzy matching. Instead they can be processed by clean_language function (coming soon) and may come in handy. But for this case, I guess you want to transform language into country. This can be achieved by transforming "name" to "alpha-2" using clean_language, then "alpha-2" to "name" using clean_country.

Jul 24 '21 03:07 NoirTree

dataprep dataprep copied to clipboard

clean: `clean_country` cannot recognize some countries names

dataprep
dataprep copied to clipboard