knowledge-graph-learning icon indicating copy to clipboard operation
knowledge-graph-learning copied to clipboard

ACL-2018-Subcharacter Information in Japanese Embeddings: When Is It Worth It?

Open BrambleXu opened this issue 6 years ago • 1 comments

Summary:

subcharacter information对于中文是有效的,那么日文又如何呢?研究发现subcharacter对于中文的提升效果在日文上并不稳定(我想应该是有片假名和平假名的缘故吧)。但是在一些汉字比较多的场景下,character ngrams效果确实有提高。不过在实验中,发现即使是enhanced skip-gram 也比不上 single-character ngram fasttext。

Resource:

  • pdf
  • [code](
  • [paper-with-code](

Paper information:

  • Author:
  • Dataset:
  • keywords:

Notes:

image

fastText是subword level model,可以学习character n-grams。

image

  • SG: we modified SG by summing the target word vector w with vectors of its constituent characters c1, and c2. This can be regarded as a special case of FastText, where the minimal n-gram size and maximum n-gram size are both set to 1.
  • SG+kanji: learn Chinese word embeddings based on characters and sub-characters (Yu 2017 Joint Embeddings of Chinese Words, Characters, and Fine-grained Subcharacter Components)
  • SG+kanji+bushu: 加了 偏旁部首 的意思

Model Graph:

Result:

Thoughts:

Next Reading:

BrambleXu avatar Dec 04 '19 06:12 BrambleXu

请问有开源么

Crescentz avatar Nov 12 '20 06:11 Crescentz