ckiptagger icon indicating copy to clipboard operation
ckiptagger copied to clipboard

POS tagging

Open loaga opened this issue 4 years ago • 3 comments

I've tried the following example as input:

      這些語辭都含有高調音

這些(Neqa) 語辭(Na) 都(D) 含有(VJ) 高(VH) 調音(VA)

With customized dictionary, it was able to tag 高調音 as Na.

word_to_weight = { "高調音": 1, "土地公": 1, "土地婆": 1, "公有": 2, "": 1, "來亂的": "啦", "緯來體育台": 1, }

word_sentence_list = ws(sentence_list, recommend_dictionary=dictionary)

Is there any code or paper describe how data (token_list.npy, vector_list.np, model_pos, etc) were trained/created?

Thanks.

loaga avatar Mar 20 '21 08:03 loaga

Both embeddings are trained using the Word2Vec model from gensim.

Here is the detail of the corpus.

emfomy avatar Mar 22 '21 02:03 emfomy

Thanks!

On March 21, 2021 at 10:01 PM Mu Yang @.***> wrote:

Both embeddings are trained using the Word2Vec model from gensim.

Here is the detail of the corpus https://github.com/ckiplab/ckiptagger/wiki/Corpora .

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub https://github.com/ckiplab/ckiptagger/issues/34#issuecomment-803712659 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6IED2TAOUPMUCJQ5CKPQTTE2QGJANCNFSM4ZQGLF4Q .

loaga avatar Mar 22 '21 09:03 loaga

On this page, I followed POS tagging link ./data/model_ner/pos_list.txt -> 詞性列表,請見 Wiki / Technical Report no. 93-05 from https://github.com/ckiplab/ckiptagger/wiki/Chinese-README

It mentioned there is a electronic dictionary that include each vocabulary's type (詞性). How get I get access?

Thanks.

loaga avatar Mar 23 '21 00:03 loaga