hanzi-tools icon indicating copy to clipboard operation
hanzi-tools copied to clipboard

Add the source used for this system

Open hugolpz opened this issue 8 years ago • 2 comments

  • [ ] what is the source ? Unihan, CJKlib, Moedict, ...
  • [ ] how many characters covered

+Thanks for this project !

hugolpz avatar Mar 16 '17 16:03 hugolpz

The dictionary used is CC-CEDICT and whatever node-pinyin uses behind the scenes. I'm not sure exactly how many characters are covered, I'll have to investigate this later.

peterolson avatar Mar 16 '17 19:03 peterolson

According to node-pinyin's Readme.md#Source

  • https://code.google.com/archive/p/chinese-character-2-pinyin/
  • maybe others pinyin sources listed as well (IME)

Strictly speaking, node-pinyin's data is in /tools/dict2.js. After cleanup, there are 24449 characters/phonetic pairs, which looks pretty much as the UNIHAN data, currently at 25500 entries. screenshot from 2017-03-17 11-08-36

screenshot from 2017-03-17 11-07-40

node-pinyin's data format doesnt suit linguistic studies tho, as there can be several phonetic entries pairing with the same characters. Without prioritization (i.e. by freq), therefore fiting IME needs but not linguistic needs.

screenshot from 2017-03-17 11-11-28

hugolpz avatar Mar 17 '17 10:03 hugolpz