SudachiPy
SudachiPy copied to clipboard
Python version of Sudachi, a Japanese tokenizer.
While working on the spaCy Japanese model support and integrating Sudachi, ran into the issue that the one-character ellipsis (`…`) was causing errors. If you tokenize this ellipsis you get...
SudachiPy Command Line Ver on Cygwin Terminal. Type: 貴社の記者が汽車で帰社する [Enter] ``` 貴社の記者が汽車で帰社する 貴社 名詞,普通名詞,一般,*,*,* 貴社 の 助詞,格助詞,*,*,*,* の 記者 名詞,普通名詞,一般,*,*,* 記者 が 助詞,格助詞,*,*,*,* が 汽車 名詞,普通名詞,一般,*,*,* 汽車 で 助詞,格助詞,*,*,*,* で...
related #74 JoinKatakana Plugin is one of the most time consuming but simple (input & output are clear) plugin. OK: cythonize OK: change implementation
related #74 Join Numeric Plugin is one of the most time consuming but simple (input & output are clear) plugin. OK: cythonize OK: change implementation
https://github.com/explosion/spaCy/issues/3756#issuecomment-516020381 https://github.com/WorksApplications/Sudachi/issues/110#issue-475708033
The original Java version has a plugin structure to add/remove extra process. How can we do similar thing with SudachiPy?