Sudachi icon indicating copy to clipboard operation
Sudachi copied to clipboard

A Japanese Tokenizer for Business

Results 29 Sudachi issues
Sort by recently updated
recently updated
newest added

We plan to introduce dictionary build warnings, which will not abort the building of the dictionary, but will report that something was not good. Warning-producing checks will be optional, but...

enhancement

if there any Utility to generate the “User dictionary source File” from a raw file ,which has Sentence and its Tokens and POS Mapping for Each Token . I mean...

if the Dictionary file is kept inside some resource jar , then the MMap.class cannot read the Dictionary file, so can this be modified to support BinaryDictionary.class.getClassLoader().getResourceAsStream("system_core.dic ") Then from...

* It is not useful * It is not really used inside Sudachi

This functionality will be removed in 1.0 as nobody seems to be using it. Please comment here if you are actually using it and do not want to have it...

Why: Use 0xf pattern for marker if a word is OOV, for reducing LatticeNodeImpl size.

related https://github.com/explosion/spaCy/issues/3756#issuecomment-516020381 ``` echo "東京都 へ 行く" | java -jar target/sudachi-0.3.0.jar 東京都 名詞,固有名詞,地名,一般,*,* 東京都 空白,*,*,*,*,* 空白,*,*,*,*,* へ 助詞,格助詞,*,*,*,* へ 空白,*,*,*,*,* 行く 動詞,非自立可能,*,*,五段-カ行,終止形-一般 行く ``` Is this expected result ? Multiple...

``` ご期待くださいーー!! ご 接頭辞,*,*,*,*,* 御 期待 名詞,普通名詞,サ変可能,*,*,* 期待 くださ 動詞,一般,*,*,五段-サ行,未然形-一般 下す いーー 感動詞,フィラー,*,*,*,* いー ! 補助記号,句点,*,*,*,* ! ! 補助記号,句点,*,*,*,* ! ``` Expected result: ``` ご期待くださいーー!! ご 接頭辞,*,*,*,*,* 御 期待 名詞,普通名詞,サ変可能,*,*,*...

The first column of a source of user dictionary is a headword for TRIE. Because input texts are normalized by `DefaultInputTextPlugin`, the headwords must be normalized in the same way....

Known word and OOV are different in segmentation although their word structures are the same. > 全国的 名詞,普通名詞,形状詞可能,*,*,* 全国的 > 間接 名詞,普通名詞,一般,*,*,* 間接 > 的 接尾辞,形状詞的,*,*,*,* 的 Adjust them by...