kanji-frequency icon indicating copy to clipboard operation
kanji-frequency copied to clipboard

Kanji decomposition (外字注記) bias in the Aozora list

Open scriptin opened this issue 3 years ago • 0 comments

It turns out, Aozora replaces some kanji with images, providing a decomposition in the alt attribute (see 外字注記). Since the dataset was generated by processing HTML files as plain text, a lot of radicals were mistakingly counted as actually appearing in the texts.

From http://vtrm.net/japanese/kanji-frequency/en:

Some kanji radicals or elements which are usually not used on their own gathered relatively high rankings. One would expect such elements not to occur at all, or nearly so. For example, in Shpika’s list, 廴, a radical not used on its own, is stated to occur 1595 times and is ranked 2294th most common kanji. The explanation is simple: when a kanji outside the JIS X 0208 set appears in a text, the Aozora Bunko policy is to break it out into simpler parts. By instance, 𢌞 (it may not be displayed correctly if you don’t have a suitable font installed) is written ※[#「廴+囘」、第4水準2-12-11], where 廴+囘 is the kanji decomposition and 第4水準2-12-11 is the JIS X 0213 code point.

Example (from 蜘蛛の糸):

<img src="../../../gaiji/1-87/1-87-71.png" alt="※(「特のへん+廴+聿」、第3水準1-87-71)" class="gaiji">

Lists of replaced characters:

  • https://www.aozora.gr.jp/gaiji_chuki/a.html
  • https://www.aozora.gr.jp/gaiji_chuki/ka.html
  • https://www.aozora.gr.jp/gaiji_chuki/sa.html
  • https://www.aozora.gr.jp/gaiji_chuki/ta.html
  • https://www.aozora.gr.jp/gaiji_chuki/na.html
  • https://www.aozora.gr.jp/gaiji_chuki/ha.html
  • https://www.aozora.gr.jp/gaiji_chuki/ma.html
  • https://www.aozora.gr.jp/gaiji_chuki/ya.html
  • https://www.aozora.gr.jp/gaiji_chuki/ra.html
  • https://www.aozora.gr.jp/gaiji_chuki/sonota.html

scriptin avatar Nov 07 '20 18:11 scriptin