newsmap
newsmap copied to clipboard
Ambiguity in Chinese seed words for CF and MN
Hello, I found some issues in the Chinese simplified dictionary. I just list it here.
- 'CF': [中非共和国, 中非*, 班吉]. The 中非 is a term used in a general context on Sino-African relation rather than a specific argument on the Central African Republic. The Current version capture so many CF because of this issue. I think it is better to omit '中非*'.
- 'MN': [蒙古*, 乌兰巴托]. '蒙古*' would capture Inner Mongolia Autonomous Region when user uses a domestic new papers. The Current version capture so many MN because of this issue. I believe it is better to use '蒙古国*' instead of '蒙古*'. I am a beginner of GitHub, so just post it here, Thanks.
Originally posted by @aseiiss in https://github.com/koheiw/newsmap/issues/6#issuecomment-1189857442
What do you think @chainsawriot?
@koheiw I am happy to change the dictionary but I think the root problem is similar to the problem of those so-called off-the-shelf dictionaries: All lexicons in text analysis are context-dependent and one can't make lexicons that fit ALL contexts (shameless plug my own article here). I think #28 is a similar issue.
I think a better solution is to provide functions to temporarily switch off certain words in a dictionary (a "negative dictionary" if you will) as a simple mechanism for context adjustment. Also, maybe we should also recommend the users handcoding a couple of articles to test for the effectiveness of this context adjustment.
If I may pitch in, I've used the dictionary to analyse a corpus of domestic news in China. In those cases, I create a copy of the dictionary and manually remove some words, or add "Inner Mongolia" in CN. There's no other way to avoid false positives for MN. It is true that "Mongolia" (in any language) generates false positives in a corpus of Chinese domestic news... but then it would need to be removed from all languages, no just the ZH dictionary.
I quite like @chainsawriot's idea of a parameter/function that switches off terms in the dictionary.
Thanks for considerations!
As one of entry users, it is very helpful to have any additional instruction or function on add/delete the country seed words at Newsmap tutorial page or somewhere else.
@aseiiss would you tell me what are the expressions for Inner Mongolia Autonomous Region that '蒙古*' matches?
@koheiw I am not @aseiiss but the expressions are 内蒙古, 内蒙, or 内蒙古自治区. With quanteda's tokenizer, 内蒙古自治区 will be tokenized into "内" "蒙古" "自治" "区". Of course, one way to fix this is to fix the quanteda's tokenizer instead.
Thanks @chainsawriot. One way to avoid matching "蒙古*" to "内蒙古" is defining it in the dictionary for China and set nested_scope = "dictionary"
. By doing this, the lookup function ignores nested phrases that matches for other categories.
require(quanteda)
toks <- tokens("内蒙古自治区")
toks
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "内" "蒙古" "自治" "区"
dict <- dictionary(list(MN = '蒙古*', CN = '内 蒙古*'))
tokens_lookup(toks, dict)
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "CN" "MN"
tokens_lookup(toks, dict, nested_scope = "dictionary")
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "CN"
If it is odd to include the autonomous region only for China, we can define a special category to ignore ambiguous words like "内 蒙古*".
dict <- dictionary(list(MN = '蒙古*', CN = '中国*', ambiguous = '内 蒙古*'))
tokens_lookup(toks, dict, nested_scope = "dictionary")
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "ambiguous"
The ambiguous category works like NOT operator in Boolean queries, but it might be nice to have a way to set exclusion rules explicitly like MN = '蒙古* NOT 内 蒙古*'