Koichi Yasuoka

Results 19 comments of Koichi Yasuoka

Yes, OK. Classical Chinese does not have any punctuations or spaces between words or sentences. Therefore, in my humble opinion, tokenization is a hard task without POS-tagging, and sentencization is...

Umm... I only know [Straka & Straková (2017)](http://ufal.mff.cuni.cz/~straka/papers/2017-conll_udpipe.pdf) approach using dynamic programming (see section 4.3), but it requires tentative parse trees...

Umm... For Japanese tokenisation (word splitting) and POS-tagging, we often apply Conditional Random Fields as [Kudo et al. (2004)](https://www.aclweb.org/anthology/W04-3230/). For Classical Chinese, we also use CRF in our [UD-Kanbun](https://github.com/KoichiYasuoka/UD-Kanbun). For...

@tiberiu44 - Thank you for using our UD_Classical_Chinese-Kyoto for your NLP-Cube. We've just finished to add 19 more volumes from "禮記" into https://github.com/UniversalDependencies/UD_Classical_Chinese-Kyoto/tree/dev for the v2.6 release of UD Treebanks...

Thank you @tiberiu44 for releasing NLP-Cube 3.0. But, well, `pytorch-lightning==1.1.7` is too old for recent `torchtext==0.10.0` so I use `pytorch-lightning==1.2.10` instead: ``` >>> from cube.api import Cube >>> nlp=Cube() >>>...

Umm... first eleven characters seem untokenized: ``` >>> from cube.api import Cube >>> nlp=Cube() >>> nlp.load("lzh") >>> doc=nlp("子曰道千乘之國敬事而信節用而愛人使民以時") >>> print(doc) 1 子曰道千乘之國敬事而信 子春于 PROPN n,名詞,人,名 NameType=Giv 2 nsubj _ _...

Thank you @tiberiu44 and I will wait for the new tokenizer. Ah, well, for sentence segmentation of the classical Chinese, I released https://huggingface.co/KoichiYasuoka/roberta-classical-chinese-large-char and https://github.com/KoichiYasuoka/SuPar-Kanbun using the segmentation algorithm of...

Thank you @tiberiu44 for releasing `nlpcube` 0.3.0.7. I tried the new model of classical Chinese with `pytorch-lightning==1.2.10` and `torchtext==0.10.0`: ``` >>> from cube.api import Cube >>> nlp=Cube() >>> nlp.load("lzh") >>>...

20.86% is much worse than the result (80%) of [一种基于循环神经网络的古文断句方法](http://xbna.pku.edu.cn/CN/10.13209/j.0479-8023.2017.032). OK, here I try myself with `transformers` on Google Colab: ``` !pip install 'transformers>=4.7.0' datasets seqeval !test -d UD_Classical_Chinese-Kyoto ||...

Thank you @tiberiu44 for releasing nlpcube 0.3.1.0. I cleaned up my `~/.nlpcube/3.0/lzh`: ``` >>> from cube.api import Cube >>> nlp=Cube() >>> nlp.load("lzh") >>> doc=nlp("天平二年正月十三日萃于帥老之宅申宴會也于時初春令月氣淑風和梅披鏡前之粉蘭薰珮後之香加以曙嶺移雲松掛羅而傾盖夕岫結霧鳥封縠而迷林庭舞新蝶空歸故鴈於是盖天坐地促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情詩紀落梅之篇古今夫何異矣宜賦園梅聊成短詠") >>> print("".join(s.text.replace(" ","")+"。" for s in...