lindera icon indicating copy to clipboard operation
lindera copied to clipboard

A multilingual morphological analysis library.

Results 25 lindera issues
Sort by recently updated
recently updated
newest added

Make Lindera available from Python.

https://clrd.ninjal.ac.jp/unidic_archive/cwj/3.1.1/unidic-cwj-3.1.1.zip https://clrd.ninjal.ac.jp/unidic/faq.html

Separate IPADIC from the source repository to reduce code size. And make it possible to maintain dictionaries for Lindera. Download the IPADIC archive here: https://github.com/lindera-morphology/mecab-ipadic

Using the 'analyzing example' from the Readme.md with the example configurations in the /resources/ folder will result in a panic for the following sentence: ``` 考えうる最も暗くて何も入っていないものを想像し 何億兆回も立方体に詰め込んで下さい そういうところに来たのです ``` with...

As reported in this [issue](https://github.com/quickwit-oss/quickwit/issues/3684), I noticed that the throughput of tokenizers based on `DictionaryKind::CcCedict`, `DictionaryKind::IPADIC`, `DictionaryKind::KoDic` is decreasing a lot on long text. The `DictionaryKind::CcCedict` can go from 10MB/s...

This feature request is similar to #191 although this focuses on loading custom dictionaries (compressed) from files, instead of the dictionary included in the binary.

Add a mode to tokenize unknown words with Uni-gram.