newsmap Add more seed dictionaries

There are more languages need to be covered:

[x] English (master)
[x] Russian
[x] German
[x] Spanish
[x] Portuguese
[x] Italian
[x] French
[ ] Dutch
[x] Chinese (simplified)
[x] Chinese (traditional)
[ ] Korean
[x] Japanese
[x] Arabic
[x] Turkish
[x] Hebrew
[ ] Hindi

All the localization should be based on the English master. If you are interested, please see the guideline for translators.

Feb 27 '18 07:02 koheiw

@danimadrid great job for Spanish!

Mar 18 '18 07:03 koheiw

Is there anyone working on Arabic dictionary? I want to help in that.

Apr 15 '18 22:04 sneetsher

Hi @sneetsher There is no one working on Arabic. That would be awesome!

Apr 16 '18 16:04 koheiw

@koheiw , Just to let you know I started translation, but I got few confusing things.

Is the words in arrays [...] just non ordered keyword list? so I can add and remove some. Like same capital name as the country name, or capital of a county has same word of another country name (same letters when words are wrote without diacritics, which is generally the case).
Does it support 2 wildcard? In Arabic, many names could have suffix & prefix in same time. Example,
```
الجزائريون
ال-جزائري-ون
ال the
جزائري Algerian
ون s
```

If you could add some instructions in English dict as comments to help translators know better the context and the use of those word in the program.

Apr 18 '18 14:04 sneetsher

@sneetsher great to know that you started working on the Arabic dictionary!

The principle is translating city and country names in the English master without adding or removing anything to make sure that all the language versions are comparable. If you think there are missing cities, please open a separate issue so that we can discuss and update all the languages in coordinated manner.

If there is an unsolvable ambiguity in Arabic, you should consider excluding some of the names. (We have to minimize false positive matches in semi-supervised learning). I trust your judgement, but please leave a note on your decision for the removal for future reference. I also wish to understand the problems in Arabic dictionaries.

As for wildcard, you can use multiple *. quanteda is optimized for wildcard at the end, but still works with one at the top or in the middle. However, handling of right-to-left languages is a new territory for the package, it is good to do some tests. I am more than happy to discuss with you on challenges in text analysis in right-to-left languages.

Please ask me any questions to make all crystal clear. I will then put them into an instruction for contributors in the Wiki.

Apr 18 '18 20:04 koheiw

@sneetsher I wrote a guideline on how to translate the English master. I hope it helps.

Apr 29 '18 15:04 koheiw

Yeah, That made it clear in many aspects, thank you. Excuse me, I didn't reply earlier, I don't have steady internet connection & I'm having much work with Wikipedia (same workflow as you explained) to get correct spelling.

By the way, I used same format of English as I understand it: [country, people, capital, very important cities ..]

I didn't want to upload any partial commits, but I'll put it in a github Gist. So you can follow it. (here is: https://gist.github.com/sneetsher/d5d5e17c09e84109d4c825b22df2207d)

Apr 29 '18 15:04 sneetsher

Yes, "[country, people, capital, very important cities ..]" is the YAML format. I will write about this in the Guideline.

Apr 29 '18 16:04 koheiw

Russian dictionary has been added. Thank you @KT01.

Jun 13 '18 20:06 koheiw

If I want to create a traditional Chinese dictionary, should I add the words to the 'chinese.yaml' or make a distinction between simplified_chinese.yaml and tradtional_chinese.yaml?

Jan 31 '19 13:01 chainsawriot

Sounds great! chinese_traditional.yml would be good as its file name. I will rename existing file to chinese_simplified.yml later. Please try to keep them comparable (functionally equivalent). Looking forward to seeing your PR.

Jan 31 '19 19:01 koheiw

Hi ! I guess that we can create the french dictionnary in a reasonnable delay.

Claude

Feb 08 '19 21:02 ClaudeGrasland

@ClaudeGrasland, amazing! Looking forward too see your pull request.

Feb 08 '19 22:02 koheiw

I am not quite familiar with github and yaml... Can you tell me how I can edit the english dictionnary and replace by french words ? Thank you in advance ! Claude

Feb 12 '19 08:02 ClaudeGrasland

YAML is a text file. Please download the English master and just open in a text editor.

Feb 12 '19 08:02 koheiw

I discovered two issues

In french dictionary, it is better to remove "Hollande" as keyword for the country of Netherlands, because it produce a confusion with the former french president François Hollande. Application of the dictionnary on french newspaper produce a dramatic number of false positive about Netherlands.
in japanese dictionary, I noticed an unexpected number of news about Thailande when trying to test on newspaper Asahi Shimbum from 2013 to 2019. According to Kohei, it is probably not related to a real media coverage but to an ambiguity with the wildcard added to the name (タイ*). When you remove the wild card (タイ) the results seems to be more consistant with empirical knowledge on the real distribution of country's salience in international news.

Apr 19 '19 07:04 ClaudeGrasland

P.S. As I can not read Japanese, I am not able to solve the issue with Thailand but I can send a sample of news for checking the origin of false positive

Apr 19 '19 07:04 ClaudeGrasland

Created a separate issue #28

Apr 19 '19 09:04 koheiw

I will be working on the Hebrew translation

Jul 24 '19 14:07 eladseg

Hello, I found some issues in the Chinese simplified dictionary. I just list it here.

'CF': [中非共和国, 中非*, 班吉]. The 中非 is a term used in a general context on Sino-African relation rather than a specific argument on the Central African Republic. The Current version capture so many CF because of this issue. I think it is better to omit '中非*'.
'MN': [蒙古*, 乌兰巴托]. '蒙古*' would capture Inner Mongolia Autonomous Region when user uses a domestic new papers. The Current version capture so many MN because of this issue. I believe it is better to use '蒙古国*' instead of '蒙古*'. I am a beginner of GitHub, so just post it here, Thanks.

Jul 20 '22 05:07 aseiiss

newsmap newsmap copied to clipboard

Add more seed dictionaries

newsmap
newsmap copied to clipboard