newsmap icon indicating copy to clipboard operation
newsmap copied to clipboard

Add more seed dictionaries

Open koheiw opened this issue 6 years ago • 20 comments

There are more languages need to be covered:

  • [x] English (master)
  • [x] Russian
  • [x] German
  • [x] Spanish
  • [x] Portuguese
  • [x] Italian
  • [x] French
  • [ ] Dutch
  • [x] Chinese (simplified)
  • [x] Chinese (traditional)
  • [ ] Korean
  • [x] Japanese
  • [x] Arabic
  • [x] Turkish
  • [x] Hebrew
  • [ ] Hindi

All the localization should be based on the English master. If you are interested, please see the guideline for translators.

koheiw avatar Feb 27 '18 07:02 koheiw

@danimadrid great job for Spanish!

koheiw avatar Mar 18 '18 07:03 koheiw

Is there anyone working on Arabic dictionary? I want to help in that.

sneetsher avatar Apr 15 '18 22:04 sneetsher

Hi @sneetsher There is no one working on Arabic. That would be awesome!

koheiw avatar Apr 16 '18 16:04 koheiw

@koheiw , Just to let you know I started translation, but I got few confusing things.

  1. Is the words in arrays [...] just non ordered keyword list? so I can add and remove some. Like same capital name as the country name, or capital of a county has same word of another country name (same letters when words are wrote without diacritics, which is generally the case).

  2. Does it support 2 wildcard? In Arabic, many names could have suffix & prefix in same time. Example,

    الجزائريون
    ال-جزائري-ون
    ال the
    جزائري Algerian
    ون s
    

If you could add some instructions in English dict as comments to help translators know better the context and the use of those word in the program.

sneetsher avatar Apr 18 '18 14:04 sneetsher

@sneetsher great to know that you started working on the Arabic dictionary!

The principle is translating city and country names in the English master without adding or removing anything to make sure that all the language versions are comparable. If you think there are missing cities, please open a separate issue so that we can discuss and update all the languages in coordinated manner.

If there is an unsolvable ambiguity in Arabic, you should consider excluding some of the names. (We have to minimize false positive matches in semi-supervised learning). I trust your judgement, but please leave a note on your decision for the removal for future reference. I also wish to understand the problems in Arabic dictionaries.

As for wildcard, you can use multiple *. quanteda is optimized for wildcard at the end, but still works with one at the top or in the middle. However, handling of right-to-left languages is a new territory for the package, it is good to do some tests. I am more than happy to discuss with you on challenges in text analysis in right-to-left languages.

Please ask me any questions to make all crystal clear. I will then put them into an instruction for contributors in the Wiki.

koheiw avatar Apr 18 '18 20:04 koheiw

@sneetsher I wrote a guideline on how to translate the English master. I hope it helps.

koheiw avatar Apr 29 '18 15:04 koheiw

Yeah, That made it clear in many aspects, thank you. Excuse me, I didn't reply earlier, I don't have steady internet connection & I'm having much work with Wikipedia (same workflow as you explained) to get correct spelling.

By the way, I used same format of English as I understand it: [country, people, capital, very important cities ..]

I didn't want to upload any partial commits, but I'll put it in a github Gist. So you can follow it. (here is: https://gist.github.com/sneetsher/d5d5e17c09e84109d4c825b22df2207d)

sneetsher avatar Apr 29 '18 15:04 sneetsher

Yes, "[country, people, capital, very important cities ..]" is the YAML format. I will write about this in the Guideline.

koheiw avatar Apr 29 '18 16:04 koheiw

Russian dictionary has been added. Thank you @KT01.

koheiw avatar Jun 13 '18 20:06 koheiw

If I want to create a traditional Chinese dictionary, should I add the words to the 'chinese.yaml' or make a distinction between simplified_chinese.yaml and tradtional_chinese.yaml?

chainsawriot avatar Jan 31 '19 13:01 chainsawriot

Sounds great! chinese_traditional.yml would be good as its file name. I will rename existing file to chinese_simplified.yml later. Please try to keep them comparable (functionally equivalent). Looking forward to seeing your PR.

koheiw avatar Jan 31 '19 19:01 koheiw

Hi ! I guess that we can create the french dictionnary in a reasonnable delay.

Claude

ClaudeGrasland avatar Feb 08 '19 21:02 ClaudeGrasland

@ClaudeGrasland, amazing! Looking forward too see your pull request.

koheiw avatar Feb 08 '19 22:02 koheiw

I am not quite familiar with github and yaml... Can you tell me how I can edit the english dictionnary and replace by french words ? Thank you in advance ! Claude

ClaudeGrasland avatar Feb 12 '19 08:02 ClaudeGrasland

YAML is a text file. Please download the English master and just open in a text editor.

koheiw avatar Feb 12 '19 08:02 koheiw

I discovered two issues

  1. In french dictionary, it is better to remove "Hollande" as keyword for the country of Netherlands, because it produce a confusion with the former french president François Hollande. Application of the dictionnary on french newspaper produce a dramatic number of false positive about Netherlands.

  2. in japanese dictionary, I noticed an unexpected number of news about Thailande when trying to test on newspaper Asahi Shimbum from 2013 to 2019. According to Kohei, it is probably not related to a real media coverage but to an ambiguity with the wildcard added to the name (タイ*). When you remove the wild card (タイ) the results seems to be more consistant with empirical knowledge on the real distribution of country's salience in international news.

ClaudeGrasland avatar Apr 19 '19 07:04 ClaudeGrasland

P.S. As I can not read Japanese, I am not able to solve the issue with Thailand but I can send a sample of news for checking the origin of false positive

ClaudeGrasland avatar Apr 19 '19 07:04 ClaudeGrasland

Created a separate issue #28

koheiw avatar Apr 19 '19 09:04 koheiw

I will be working on the Hebrew translation

eladseg avatar Jul 24 '19 14:07 eladseg

Hello, I found some issues in the Chinese simplified dictionary. I just list it here.

  1. 'CF': [中非共和国, 中非*, 班吉]. The 中非 is a term used in a general context on Sino-African relation rather than a specific argument on the Central African Republic. The Current version capture so many CF because of this issue. I think it is better to omit '中非*'.
  2. 'MN': [蒙古*, 乌兰巴托]. '蒙古*' would capture Inner Mongolia Autonomous Region when user uses a domestic new papers. The Current version capture so many MN because of this issue. I believe it is better to use '蒙古国*' instead of '蒙古*'. I am a beginner of GitHub, so just post it here, Thanks.

aseiiss avatar Jul 20 '22 05:07 aseiiss