NLP-Cube Classical Chinese Model needed

Classical Chinese Model needed

Open KoichiYasuoka opened this issue 5 years ago • 31 comments

I've almost finished to build up UD_Classical_Chinese-Kyoto Treebank, and now I'm trying to make a Classical Chinese model for NLP-Cube (please check my diary). But in my model sentence_accuracy<35 and I can't sentencize "天平二年正月十三日萃于帥老之宅申宴會也于時初春令月氣淑風和梅披鏡前之粉蘭薰珮後之香加以曙嶺移雲松掛羅而傾盖夕岫結霧鳥封縠而迷林庭舞新蝶空歸故鴈於是盖天促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情詩紀落梅之篇古今夫何異矣宜賦園梅聊成短詠" (check gold standard here). How do I tune up sentencization for Classical Chinese?

May 07 '19 04:05 KoichiYasuoka

I looked over the corpus, and I see there are no delimiters (punctuation marks) for sentences. Is this ik?

May 07 '19 05:05 tiberiu44

Yes, OK. Classical Chinese does not have any punctuations or spaces between words or sentences. Therefore, in my humble opinion, tokenization is a hard task without POS-tagging, and sentencization is a hard task without dependency parsing...

May 07 '19 06:05 KoichiYasuoka

I think we could go for jointly POS-Tagging and tokenising. Unfortunately, the algorithm we use for dependency parsing requires us to build a NxN matrix for all the words (N), which is likely to cause an out of memory error if we use all tokens. Do you know of any other approach, that does not require dependency parsing for sentence segmentation?

May 07 '19 07:05 tiberiu44

Umm... I only know Straka & Straková (2017) approach using dynamic programming (see section 4.3), but it requires tentative parse trees...

May 07 '19 07:05 KoichiYasuoka

I see. I can imagine joint sentence segmentation and parsing working by using the ARC-system. Whenever the stack is emptied, it implies that a sentence boundary should be generated.

We've finished work for the Parser and Tagger for version 2.0, but we still haven't found a good solution for tokenization/sentence splitting.

I think I will give this new approach a try, but it will take some time to implement. I'll let you know when it's done and maybe you can test it on your corpus.

Thanks for the feedback, Tibi

May 07 '19 07:05 tiberiu44

@KoichiYasuoka - i haven't had any success with the tokenizer/sentence splitter so far. We are working on rolling out version 2.0 which uses a single model conditionally trained with language embeddings. We have great accuracy figures for the parser and tagger. However, we are still experiencing difficulties with the tokenizer (for all languages).

We tried jointly tagging/parsing and tokenizing, but we simply got the same results as if we would do these two tasks independently. Any suggestions on how to proceed?

Oct 31 '19 08:10 tiberiu44

Umm... For Japanese tokenisation (word splitting) and POS-tagging, we often apply Conditional Random Fields as Kudo et al. (2004). For Classical Chinese, we also use CRF in our UD-Kanbun.

For sentence segmentation in Classical Chinese, recent progress has been made by Hu et al. (2019) at https://seg.shenshen.wiki/. Hu et al. uses BERT-model, which is trained by enormous Classical Chinese texts of 3.3×10⁹ characters...

Oct 31 '19 13:10 KoichiYasuoka

@KoichiYasuoka - I hope you are doing well in this time of crisis.

It's been a long time since our last progress update on this issue. We started training the 2.0 models for NLP-Cube and they should be out soon. I saw the Classical Chinese corpus in the UD Treebanks (v2.5). The model will be included in this release. Congratulations and thank you for your work.

I thought you might be interested in the fact that we are also setting up a "model zoo" for NLP-Cube, so contributors can publish their pre-trained models. We will try to make research attribution easy, by printing a banner with copyright and/or citing options for these models.

May 01 '20 07:05 tiberiu44

@tiberiu44 - Thank you for using our UD_Classical_Chinese-Kyoto for your NLP-Cube. We've just finished to add 19 more volumes from "禮記" into https://github.com/UniversalDependencies/UD_Classical_Chinese-Kyoto/tree/dev for the v2.6 release of UD Treebanks (scheduled on May 15, 2020). Enjoy!

May 01 '20 10:05 KoichiYasuoka

Hi @KoichiYasuoka ,

We've finished releasing the current version of NLPCube and we included the classical Chinese model from 2.7. Sentence segmentation seems to be problematic for this treebank. You can check branch 3.0 of the repo to get more info: https://github.com/adobe/NLP-Cube/tree/3.0

If you have any suggestions regarding sentence segmentation, please let me know. Right now we are using xlm-roberta-base for language modeling, but maybe there is some other LM that can provide better results.

Best, Tiberiu

Aug 12 '21 06:08 tiberiu44

Thank you @tiberiu44 for releasing NLP-Cube 3.0. But, well, pytorch-lightning==1.1.7 is too old for recent torchtext==0.10.0 so I use pytorch-lightning==1.2.10 instead:

>>> from cube.api import Cube
>>> nlp=Cube()
>>> nlp.load("lzh")
>>> doc=nlp("不入虎穴不得虎子")
>>> print(doc)
1	不入虎穴不得虎子	叔津	PROPN	n,名詞,人,複合的人名	NameType=Prs	0	root	_	_

Umm... tokenization of classical Chinese doesn't work here...

Aug 12 '21 10:08 KoichiYasuoka

Yes, I see something is definitely wrong with the model. Just tried you example and tokenization did not work. However, on longer examples it seems to behave differently:

Out[13]:
1	子曰學而時習之不亦說乎	子春城	PROPN	n,名詞,人,名	NameType=Giv	2	nsubj	_	_
2	有	有	VERB	v,動詞,存在,存在	_	0	root	_	_
3	朋	朋	NOUN	n,名詞,人,関係	_	2	obj	_	_
4	自	自	ADP	v,前置詞,経由,*	_	6	case	_	_
5	遠	遠	VERB	v,動詞,描写,量	Degree=Pos|VerbForm=Part	6	amod	_	_
6	方	方	NOUN	n,名詞,固定物,関係	Case=Loc	7	obl	_	_
7	來	來	VERB	v,動詞,行為,移動	_	2	ccomp	_	_
8	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	14	advmod	_	_
9	亦	亦	ADV	v,副詞,頻度,重複	_	10	advmod	_	_
10	樂	樂	VERB	v,動詞,行為,態度	_	2	conj	_	_
11	乎	乎	ADP	v,前置詞,基盤,*	_	12	case	_	_
12	人	人	NOUN	n,名詞,人,人	_	7	obl	_	_
13	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	14	advmod	_	_
14	知	知	VERB	v,動詞,行為,動作	_	10	parataxis	_	_

1	而	而	CCONJ	p,助詞,接続,並列	_	3	advmod	_	_
2	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	3	advmod	_	_
3	慍	慍	VERB	v,動詞,行為,態度	_	6	csubj	_	_
4	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	6	advmod	_	_
5	亦	亦	ADV	v,副詞,頻度,重複	_	6	advmod	_	_
6	君子	君子	NOUN	n,名詞,人,役割	_	0	root	_	_
7	乎	乎	PART	p,助詞,句末,*	_	6	discourse:sp	_	_```

I will try retraining the tokenizer with a different LM.

Aug 12 '21 11:08 tiberiu44

Umm... first eleven characters seem untokenized:

>>> from cube.api import Cube
>>> nlp=Cube()
>>> nlp.load("lzh")
>>> doc=nlp("子曰道千乘之國敬事而信節用而愛人使民以時")
>>> print(doc)
1	子曰道千乘之國敬事而信	子春于	PROPN	n,名詞,人,名	NameType=Giv	2	nsubj	_	_
2	節	節	VERB	v,動詞,描写,態度	Degree=Pos	0	root	_	_
3	用	用	VERB	v,動詞,行為,動作	_	2	flat:vv	_	_

1	而	而	CCONJ	p,助詞,接続,並列	_	2	advmod	_	_
2	愛	愛	VERB	v,動詞,行為,交流	_	6	csubj	_	_
3	人	人	NOUN	n,名詞,人,人	_	2	obj	_	_
4	使	使	VERB	v,動詞,行為,使役	_	2	parataxis	_	_
5	民	民	NOUN	n,名詞,人,人	_	4	obj	_	_
6	以	以	VERB	v,動詞,行為,動作	_	0	root	_	_
7	時	時	NOUN	n,名詞,時,*	Case=Tem	6	obj	_	_

Aug 12 '21 11:08 KoichiYasuoka

Yes, seems to be a recurring issue with any text I try. I'm retraining the tokenizer/sentence splitter right now (it will take a couple of hours). Hopefully, this will solve the problem. I'll let you know as soon as I publish the new model.

Aug 12 '21 12:08 tiberiu44

Thank you @tiberiu44 and I will wait for the new tokenizer. Ah, well, for sentence segmentation of the classical Chinese, I released https://huggingface.co/KoichiYasuoka/roberta-classical-chinese-large-char and https://github.com/KoichiYasuoka/SuPar-Kanbun using the segmentation algorithm of 一种基于循环神经网络的古文断句方法. I hope these help you.

Aug 12 '21 12:08 KoichiYasuoka

This is perfect. I will use your model to train the Classical Chinese pipeline:

python3 cube/trainer.py --task=tokenizer --train=scripts/train/2.7/language/lzh.yaml --store=data/lzh-trf-tokenizer --num-workers=0 --lm-device=cuda:0 --gpus=1 --lm-model=transformer:KoichiYasuoka/roberta-classical-chinese-large-char

Given that this is a dedicated model, I hope it will provide better results than any other LM.

Thank you for this.

Aug 12 '21 12:08 tiberiu44

Thank you @tiberiu44 for releasing nlpcube 0.3.0.7. I tried the new model of classical Chinese with pytorch-lightning==1.2.10 and torchtext==0.10.0:

>>> from cube.api import Cube
>>> nlp=Cube()
>>> nlp.load("lzh")
>>> doc=nlp("不入虎穴不得虎子")
>>> print(doc)
1	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	2	advmod	_	_
2	入	入	VERB	v,動詞,行為,移動	_	0	root	_	_
3	虎	虎	NOUN	n,名詞,主体,動物	_	4	nmod	_	_
4	穴	<UNK>	NOUN	n,名詞,可搬,道具	_	2	obj	_	_

1	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	2	advmod	_	_
2	得	得	VERB	v,動詞,行為,得失	_	0	root	_	_

1	虎	虎	NOUN	n,名詞,主体,動物	_	0	root	_	_

1	子 	子產	PROPN	n,名詞,人,名	NameType=Giv	0	root	_	_;compund

The tokenization seems to work well this time. Now the problem is the sentence segmentation...

Aug 12 '21 23:08 KoichiYasuoka

Thank you for the feedback. I'm working on that right now. Hope to get it fixed soon.

Aug 13 '21 13:08 tiberiu44

So far, I only got an sentence f-score of 20 (best result using your RobertaModel):

-----------+-----------+-----------+-----------+-----------
Tokens     |     98.40 |     97.34 |     97.87 |
Sentences  |     34.06 |     15.03 |     20.86 |
Words      |     98.40 |     97.34 |     97.87 |
UPOS       |     92.36 |     91.37 |     91.86 |     93.86
XPOS       |     89.27 |     88.31 |     88.78 |     90.72
UFeats     |     92.95 |     91.95 |     92.45 |     94.46
AllTags    |     87.35 |     86.41 |     86.88 |     88.77
Lemmas     |     92.01 |     91.02 |     91.51 |     93.51
UAS        |     66.76 |     66.04 |     66.40 |     67.84
LAS        |     61.46 |     60.80 |     61.13 |     62.46
CLAS       |     60.49 |     59.19 |     59.83 |     60.96
MLAS       |     56.81 |     55.59 |     56.20 |     57.25
BLEX       |     56.06 |     54.86 |     55.45 |     56.49

The UAS and LAS scores are low because every time it get's a sentence wrong, the system will also mislabel the root node.

Aug 14 '21 06:08 tiberiu44

20.86% is much worse than the result (80%) of 一种基于循环神经网络的古文断句方法. OK, here I try myself with transformers on Google Colab:

!pip install 'transformers>=4.7.0' datasets seqeval
!test -d UD_Classical_Chinese-Kyoto || git clone https://github.com/universaldependencies/UD_Classical_Chinese-Kyoto
!test -f run_ner.py || curl -LO https://raw.githubusercontent.com/huggingface/transformers/v`pip list | sed -n 's/^transformers *\([^ ]*\) *$/\1/p'`/examples/pytorch/token-classification/run_ner.py

for d in ["train","dev","test"]:
  with open("UD_Classical_Chinese-Kyoto/lzh_kyoto-ud-"+d+".conllu","r",encoding="utf-8") as f:
    r=f.read()
  with open(d+".json","w",encoding="utf-8") as f:
    tokens=[]
    tags=[]
    i=0
    for s in r.split("\n"):
      t=s.split("\t")
      if len(t)==10:
        for c in t[1]:
          tokens.append(c)
          i+=1
      else:
        if i==1:
          tags.append("S")
        elif i==2:
          tags+=["B","E"]
        elif i==3:
          tags+=["B","E2","E"]
        elif i>3:
          tags+=["B"]+["M"]*(i-4)+["E3","E2","E"]
        i=0
        if len(tokens)>80:
          print("{\"tokens\":[\""+"\",\"".join(tokens)+"\"],\"tags\":[\""+"\",\"".join(tags)+"\"]}",file=f)
          tokens=[]
          tags=[]

!python run_ner.py --model_name_or_path KoichiYasuoka/roberta-classical-chinese-large-char --train_file train.json --validation_file dev.json --test_file test.json --output_dir my.danku --do_train --do_eval

I got "eval metrics" as follows:

***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.9212
  eval_f1                 =     0.8995
  eval_loss               =     0.2794
  eval_precision          =     0.8991
  eval_recall             =     0.8998
  eval_runtime            = 0:00:09.70
  eval_samples            =        329
  eval_samples_per_second =     33.901
  eval_steps_per_second   =      4.328

Then I tried to sentencize the paragraph I wrote two years ago (https://github.com/adobe/NLP-Cube/issues/100#issue-441024053):

import torch
from transformers import AutoTokenizer,AutoModelForTokenClassification
tkz=AutoTokenizer.from_pretrained("my.danku")
mdl=AutoModelForTokenClassification.from_pretrained("my.danku")
s="天平二年正月十三日萃于帥老之宅申宴會也于時初春令月氣淑風和梅披鏡前之粉蘭薰珮後之香加以曙嶺移雲松掛羅而傾盖夕岫結霧鳥封縠而迷林庭舞新蝶空歸故鴈於是盖天坐地促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情詩紀落梅之篇古今夫何異矣宜賦園梅聊成短詠"
e=tkz.encode(s,return_tensors="pt")
p=[mdl.config.id2label[q] for q in torch.argmax(mdl(e)[0],dim=2)[0].tolist()[1:-1]]
print("".join(c+"。" if q=="E" or q=="S" else c for c,q in zip(s,p)))

And I got the result "天平二年正月十三日萃于帥老之宅。申宴會也。于時初春令月。氣淑風和。梅披鏡前之粉。蘭薰珮後之香。加以曙嶺移雲。松掛羅而傾盖。夕岫結霧。鳥封縠而迷林。庭舞新蝶。空歸故鴈。於是盖天坐地。促膝飛觴。忘言一室之裏。開衿煙霞之外。淡然自放。快然自足。若非翰苑何以攄情。詩紀落梅之篇。古今夫何異矣。宜賦園梅。聊成短詠。" How about your system @tiberiu44?

Aug 15 '21 07:08 KoichiYasuoka

Unfortunately, I canot run the test right now and I will be away from keyboard most of the day. I will try your approach with transformers tomorrow.

The latest models are pushed if you want to try them. If you already loaded lzh, you will need to trigger a redownload of the model.

The easiest way is to remove all lzh files located in the userhome/.nlpcube/3.0 (anythint that starts with lzh, incuding a folder)

Aug 15 '21 07:08 tiberiu44

Thank you @tiberiu44 for releasing nlpcube 0.3.1.0. I cleaned up my ~/.nlpcube/3.0/lzh:

>>> from cube.api import Cube
>>> nlp=Cube()
>>> nlp.load("lzh")
>>> doc=nlp("天平二年正月十三日萃于帥老之宅申宴會也于時初春令月氣淑風和梅披鏡前之粉蘭薰珮後之香加以曙嶺移雲松掛羅而傾盖夕岫結霧鳥封縠而迷林庭舞新蝶空歸故鴈於是盖天坐地促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情詩紀落梅之篇古今夫何異矣宜賦園梅聊成短詠")
>>> print("".join(s.text.replace(" ","")+"。" for s in doc.sentences))

And I've got the result "天平二年正月十三日萃于帥老之宅申宴會也。于時初春令月氣淑風和。梅披鏡前之粉蘭薰珮後之香。加以曙嶺移雲松掛羅而傾盖。夕岫結霧。鳥封縠而迷林庭舞新蝶空歸故鴈。於是盖天坐地促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情。詩紀落梅之篇古今夫何異矣。宜賦園梅。聊。成。短詠。" Umm... "聊。成。短詠。" seems unmeaningful but other segmentations are rather good. Then, how do we improve...

Aug 15 '21 08:08 KoichiYasuoka

On your previous example, the current version of the tokenizer generates this sentence segmentation:

1	天平	天平	NOUN	n,名詞,時,*	Case=Tem	3	nmod	_	_
2	二	二	NUM	n,数詞,数字,*	_	3	nummod	_	_
3	年	年	NOUN	n,名詞,時,*	Case=Tem	8	obl:tmod	_	_
4	正	正	NOUN	n,名詞,時,*	_	5	amod	_	_
5	月	月	NOUN	n,名詞,時,*	Case=Tem	8	obl:tmod	_	_
6	十三	十三	NUM	n,数詞,数,*	_	7	nummod	_	_
7	日	日	NOUN	n,名詞,時,*	Case=Tem	8	obl:tmod	_	_
8	萃	<UNK>	VERB	v,動詞,行為,動作	_	0	root	_	_
9	于	于	ADP	v,前置詞,基盤,*	_	13	case	_	_
10	帥	帥	NOUN	n,名詞,人,役割	_	11	amod	_	_
11	老	老	NOUN	n,名詞,人,人	_	13	nmod	_	_
12	之	之	SCONJ	p,助詞,接続,属格	_	11	case	_	_
13	宅	宅	NOUN	n,名詞,固定物,建造物	Case=Loc	8	obl:lmod	_	_
14	申	申	VERB	v,動詞,行為,動作	_	8	parataxis	_	_
15	宴	宴	VERB	v,動詞,行為,交流	VerbForm=Part	14	obj	_	_
16	會	會	VERB	v,動詞,行為,交流	_	15	flat:vv	_	_
17	也	也	PART	p,助詞,句末,*	_	8	discourse:sp	_	_

1	于	于	ADP	v,前置詞,基盤,*	_	2	case	_	_
2	時	時	NOUN	n,名詞,時,*	Case=Tem	8	obl:tmod	_	_
3	初	初	NOUN	n,名詞,時,*	Case=Tem	4	nmod	_	_
4	春	春	NOUN	n,名詞,時,*	Case=Tem	6	nmod	_	_
5	令	令	NOUN	n,名詞,人,役割	_	6	nmod	_	_
6	月	月	NOUN	n,名詞,時,*	Case=Tem	8	nsubj	_	_
7	氣	氣	NOUN	n,名詞,描写,形質	_	8	nsubj	_	_
8	淑	淑	VERB	v,動詞,描写,態度	Degree=Pos	0	root	_	_
9	風	風	NOUN	n,名詞,天象,気象	_	10	nsubj	_	_
10	和	和	VERB	v,動詞,描写,形質	Degree=Pos	8	conj	_	_

1	梅	梅	NOUN	n,名詞,固定物,樹木	_	2	nsubj	_	_
2	披	披	VERB	v,動詞,行為,動作	_	0	root	_	_
3	鏡	<UNK>	NOUN	n,名詞,可搬,道具	_	4	nmod	_	_
4	前	前	NOUN	n,名詞,固定物,関係	Case=Loc	6	nmod	_	_
5	之	之	SCONJ	p,助詞,接続,属格	_	4	case	_	_
6	粉	<UNK>	NOUN	n,名詞,不可譲,身体	_	2	obj	_	_

1	蘭	蘭	NOUN	n,名詞,可搬,道具	_	2	nsubj	_	_
2	薰	<UNK>	NOUN	n,名詞,可搬,道具	_	0	root	_	_
3	珮	<UNK>	NOUN	n,名詞,可搬,道具	_	4	nmod	_	_
4	後	後	NOUN	n,名詞,固定物,関係	Case=Tem	6	nmod	_	_
5	之	之	SCONJ	p,助詞,接続,属格	_	4	case	_	_
6	香	香	NOUN	n,名詞,描写,形質	_	2	obj	_	_

1	加	加	VERB	v,動詞,行為,得失	_	5	advmod	_	_
2	以	以	VERB	v,動詞,行為,動作	_	5	advcl	_	_
3	曙	<UNK>	NOUN	n,名詞,描写,形質	_	4	nmod	_	_
4	嶺	<UNK>	NOUN	n,名詞,固定物,地形	Case=Loc	2	obj	_	_
5	移	移	VERB	v,動詞,行為,移動	_	0	root	_	_
6	雲	雲	NOUN	n,名詞,天象,気象	_	5	obj	_	_

1	松	松	PROPN	n,名詞,人,名	NameType=Giv	0	root	_	_

1	掛	<UNK>	VERB	v,動詞,行為,動作	_	0	root	_	_
2	羅	羅	NOUN	n,名詞,可搬,道具	_	1	obj	_	_
3	而	而	CCONJ	p,助詞,接続,並列	_	4	cc	_	_
4	傾	傾	VERB	v,動詞,行為,動作	_	1	conj	_	_
5	盖	<UNK>	NOUN	n,名詞,可搬,道具	_	4	obj	_	_

1	夕	夕	NOUN	n,名詞,時,*	Case=Tem	2	nmod	_	_
2	岫	<UNK>	NOUN	n,名詞,固定物,地形	Case=Loc	3	nsubj	_	_
3	結	結	VERB	v,動詞,行為,動作	_	0	root	_	_
4	霧	<UNK>	NOUN	n,名詞,可搬,道具	_	3	obj	_	_

1	鳥	鳥	NOUN	n,名詞,主体,動物	_	2	nsubj	_	_
2	封	封	VERB	v,動詞,行為,役割	_	45	csubj	_	_
3	縠	<UNK>	NOUN	n,名詞,可搬,道具	_	2	obj	_	_
4	而	而	CCONJ	p,助詞,接続,並列	_	5	cc	_	_
5	迷	<UNK>	VERB	v,動詞,行為,動作	_	2	conj	_	_
6	林	林	NOUN	n,名詞,固定物,地形	Case=Loc	31	obj	_	_
7	庭	庭	NOUN	n,名詞,固定物,建造物	Case=Loc	40	obl:lmod	_	_
8	舞	舞	VERB	v,動詞,行為,動作	_	2	conj	_	_
9	新	新	VERB	v,動詞,描写,形質	Degree=Pos|VerbForm=Part	10	amod	_	_
10	蝶	<UNK>	NOUN	n,名詞,可搬,道具	_	5	obj	_	_
11	空	空	ADV	v,動詞,描写,形質	Degree=Pos|VerbForm=Conv	40	advmod	_	_
12	歸	歸	VERB	v,動詞,行為,移動	_	2	conj	_	_
13	故	故	NOUN	n,名詞,時,*	Case=Tem	14	nmod	_	_
14	鴈	<UNK>	NOUN	n,名詞,主体,動物	_	40	nsubj	_	_
15	於	於	ADP	v,前置詞,基盤,*	_	16	case	_	_
16	是	是	PRON	n,代名詞,指示,*	PronType=Dem	2	obl	_	_
17	盖	<UNK>	NOUN	n,名詞,不可譲,身体	_	40	nsubj	_	_
18	天	天	NOUN	n,名詞,制度,場	Case=Loc	2	obl	_	_
19	坐	坐	VERB	v,動詞,行為,動作	_	2	conj	_	_
20	地	地	NOUN	n,名詞,固定物,地形	Case=Loc	5	obj	_	_
21	促	<UNK>	VERB	v,動詞,行為,動作	_	2	conj	_	_
22	膝	<UNK>	NOUN	n,名詞,可搬,道具	_	31	obj	_	_
23	飛	飛	VERB	v,動詞,行為,動作	_	2	conj	_	_
24	觴	<UNK>	NOUN	n,名詞,可搬,道具	_	31	obj	_	_
25	忘	忘	VERB	v,動詞,行為,動作	_	2	conj	_	_
26	言	言	NOUN	n,名詞,可搬,伝達	_	31	obj	_	_
27	一	一	NUM	n,数詞,数字,*	_	28	nummod	_	_
28	室	室	NOUN	n,名詞,固定物,建造物	Case=Loc	36	nmod	_	_
29	之	之	SCONJ	p,助詞,接続,属格	_	28	case	_	_
30	裏	<UNK>	NOUN	n,名詞,固定物,関係	Case=Loc	2	conj	_	_
31	開	開	VERB	v,動詞,行為,動作	_	2	conj	_	_
32	衿	<UNK>	NOUN	n,名詞,不可譲,身体	_	31	obj	_	_
33	煙	<UNK>	NOUN	n,名詞,固定物,樹木	_	31	obj	_	_
34	霞	<UNK>	NOUN	n,名詞,固定物,樹木	_	33	flat	_	_
35	之	之	SCONJ	p,助詞,接続,属格	_	28	case	_	_
36	外	外	NOUN	n,名詞,固定物,関係	Case=Loc	2	obj	_	_
37	淡	<UNK>	ADV	v,動詞,描写,形質	Degree=Pos|VerbForm=Conv	2	conj	_	_
38	然	然	PART	p,接尾辞,*,*	_	37	fixed	_	_
39	自	自	PRON	n,代名詞,人称,他	PronType=Prs|Reflex=Yes	40	nsubj	_	_
40	放	放	VERB	v,動詞,行為,動作	_	2	conj	_	_
41	快	<UNK>	VERB	v,動詞,描写,態度	Degree=Pos	40	advmod	_	_
42	然	然	PART	p,接尾辞,*,*	_	37	fixed	_	_
43	自	自	PRON	n,代名詞,人称,他	PronType=Prs|Reflex=Yes	50	obj	_	_
44	足	足	VERB	v,動詞,描写,量	Degree=Pos	2	conj	_	_
45	若	若	VERB	v,動詞,行為,分類	Degree=Equ	0	root	_	_
46	非	非	ADV	v,副詞,否定,体言否定	Polarity=Neg	48	amod	_	_
47	翰	翰	NOUN	n,名詞,可搬,道具	_	48	nmod	_	_
48	苑	苑	NOUN	n,名詞,固定物,建造物	Case=Loc	51	nsubj	_	_
49	何	何	PRON	n,代名詞,疑問,*	PronType=Int	50	obj	_	_
50	以	以	VERB	v,動詞,行為,動作	_	51	advcl	_	_
51	攄	<UNK>	VERB	v,動詞,行為,動作	_	44	parataxis	_	_
52	情	情	NOUN	n,名詞,描写,態度	_	51	obj	_	_

1	詩	詩	NOUN	n,名詞,主体,書物	_	2	nsubj	_	_
2	紀	紀	VERB	v,動詞,行為,動作	_	0	root	_	_
3	落	落	VERB	v,動詞,行為,移動	VerbForm=Part	4	amod	_	_
4	梅	梅	NOUN	n,名詞,固定物,樹木	_	6	nmod	_	_
5	之	之	SCONJ	p,助詞,接続,属格	_	4	case	_	_
6	篇	篇	NOUN	n,名詞,可搬,伝達	_	2	obj	_	_

1	古	古	NOUN	n,名詞,時,*	Case=Tem	5	nsubj	_	_
2	今	今	NOUN	n,名詞,時,*	Case=Tem	1	conj	_	_
3	夫	夫	PART	p,助詞,句頭,*	_	5	discourse	_	_
4	何	何	ADV	v,副詞,疑問,原因	AdvType=Cau	5	advmod	_	_
5	異	異	VERB	v,動詞,描写,形質	Degree=Pos	0	root	_	_
6	矣	矣	PART	p,助詞,句末,*	_	5	discourse:sp	_	_

1	宜	宜	AUX	v,助動詞,必要,*	Mood=Nec	2	aux	_	_
2	賦	賦	VERB	v,動詞,行為,動作	_	0	root	_	_
3	園	園	NOUN	n,名詞,固定物,建造物	Case=Loc	4	nmod	_	_
4	梅	梅	NOUN	n,名詞,固定物,樹木	_	2	obj	_	_

1	聊	<UNK>	ADV	v,動詞,行為,動作	VerbForm=Conv	2	advmod	_	_
2	成	成	VERB	v,動詞,行為,生産	_	0	root	_	_
3	短	短	VERB	v,動詞,描写,量	Degree=Pos	4	advmod	_	_
4	詠	詠	VERB	v,動詞,行為,伝達	_	2	ccomp	_	_

Is this an improvement?

Aug 17 '21 09:08 tiberiu44

Yes, yes @tiberiu44 it seems much better result except for "松". But I could not download the improved model after I cleaned ~/.nlpcube/3.0/lzh up. Well, has the new model been released?

Aug 18 '21 09:08 KoichiYasuoka

It's not published. The sentence segmentation is still bad. Also, token is worse:

Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     93.29 |     92.62 |     92.96 |
Sentences  |     27.12 |      7.65 |     11.94 |
Words      |     93.29 |     92.62 |     92.96 |
UPOS       |     87.02 |     86.40 |     86.71 |     93.28
XPOS       |     84.06 |     83.46 |     83.76 |     90.11
UFeats     |     88.16 |     87.53 |     87.84 |     94.50
AllTags    |     82.22 |     81.64 |     81.93 |     88.14
Lemmas     |     89.80 |     89.15 |     89.47 |     96.26
UAS        |     43.40 |     43.09 |     43.24 |     46.52
LAS        |     39.54 |     39.26 |     39.40 |     42.38
CLAS       |     38.00 |     36.86 |     37.42 |     39.96
MLAS       |     35.55 |     34.49 |     35.01 |     37.39
BLEX       |     36.87 |     35.76 |     36.31 |     38.77

Aug 18 '21 10:08 tiberiu44

I've released https://huggingface.co/KoichiYasuoka/roberta-classical-chinese-large-sentence-segmentation for sentence segmentation of classical Chinese. You can use it with transformers>=4.1:

import torch
from transformers import AutoTokenizer,AutoModelForTokenClassification
tokenizer=AutoTokenizer.from_pretrained("KoichiYasuoka/roberta-classical-chinese-large-sentence-segmentation")
model=AutoModelForTokenClassification.from_pretrained("KoichiYasuoka/roberta-classical-chinese-large-sentence-segmentation")
s="天平二年正月十三日萃于帥老之宅申宴會也于時初春令月氣淑風和梅披鏡前之粉蘭薰珮後之香加以曙嶺移雲松掛羅而傾盖夕岫結霧鳥封縠而迷林庭舞新蝶空歸故鴈於是盖天坐地促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情詩紀落梅之篇古今夫何異矣宜賦園梅聊成短詠"
p=[model.config.id2label[q] for q in torch.argmax(model(tokenizer.encode(s,return_tensors="pt"))[0],dim=2)[0].tolist()[1:-1]]
print("".join(c+"。" if q=="E" or q=="S" else c for c,q in zip(s,p)))

Aug 18 '21 13:08 KoichiYasuoka

Do we have permission to use your model in NLPCube? Do you need any citation or notice when somebody loads it?

Aug 18 '21 13:08 tiberiu44

The models are distributed under the Apache License 2.0. You can use them (almost) freely except for trademarks.

Aug 18 '21 13:08 KoichiYasuoka

This sounds good. I will update the runtime code for the tokenizer to be able to use transformer models for tokenization.

Aug 18 '21 13:08 tiberiu44

One more question: does your model also support tokenization or just sentence segmentation?

Aug 18 '21 13:08 tiberiu44

NLP-Cube NLP-Cube copied to clipboard

Classical Chinese Model needed

NLP-Cube
NLP-Cube copied to clipboard