hanzi-tools Add the source used for this system

Add the source used for this system

Open hugolpz opened this issue 8 years ago • 2 comments

[ ] what is the source ? Unihan, CJKlib, Moedict, ...
[ ] how many characters covered

+Thanks for this project !

Mar 16 '17 16:03 hugolpz

The dictionary used is CC-CEDICT and whatever node-pinyin uses behind the scenes. I'm not sure exactly how many characters are covered, I'll have to investigate this later.

Mar 16 '17 19:03 peterolson

According to node-pinyin's Readme.md#Source

https://code.google.com/archive/p/chinese-character-2-pinyin/
maybe others pinyin sources listed as well (IME)

Strictly speaking, node-pinyin's data is in /tools/dict2.js. After cleanup, there are 24449 characters/phonetic pairs, which looks pretty much as the UNIHAN data, currently at 25500 entries. screenshot from 2017-03-17 11-08-36

screenshot from 2017-03-17 11-07-40

node-pinyin's data format doesnt suit linguistic studies tho, as there can be several phonetic entries pairing with the same characters. Without prioritization (i.e. by freq), therefore fiting IME needs but not linguistic needs.

screenshot from 2017-03-17 11-11-28

Mar 17 '17 10:03 hugolpz

hanzi-tools hanzi-tools copied to clipboard

Add the source used for this system

hanzi-tools
hanzi-tools copied to clipboard