lindera issues

Make Lindera available from Python.

Migrate UniDic3

https://clrd.ninjal.ac.jp/unidic_archive/cwj/3.1.1/unidic-cwj-3.1.1.zip https://clrd.ninjal.ac.jp/unidic/faq.html

Separate IPADIC from the source repository

Separate IPADIC from the source repository to reduce code size. And make it possible to maintain dictionaries for Lindera. Download the IPADIC archive here: https://github.com/lindera-morphology/mecab-ipadic

mosuka

Invalid digit number

3

closes: #326

higumachan

Integer overflow in lindera-filter

3

Using the 'analyzing example' from the Readme.md with the example configurations in the /resources/ folder will result in a panic for the following sentence: ``` 考えうる最も暗くて何も入っていないものを想像し何億兆回も立方体に詰め込んで下さいそういうところに来たのです ``` with...

JojiiOfficial

Tokenizers throughput decrease a lot on long text.

3

As reported in this [issue](https://github.com/quickwit-oss/quickwit/issues/3684), I noticed that the throughput of tokenizers based on `DictionaryKind::CcCedict`, `DictionaryKind::IPADIC`, `DictionaryKind::KoDic` is decreasing a lot on long text. The `DictionaryKind::CcCedict` can go from 10MB/s...

fmassot

Support compressed dictionaries

This feature request is similar to #191 although this focuses on loading custom dictionaries (compressed) from files, instead of the dictionary included in the binary.

JojiiOfficial

Add Extended mode

Add a mode to tokenize unknown words with Uni-gram.

mosuka

lindera
lindera copied to clipboard

Metadata

Support ko-dic user dictionary

Support UniDic user dictionary

Add Python bindings

Migrate UniDic3

Separate IPADIC from the source repository

Invalid digit number

Integer overflow in lindera-filter

Tokenizers throughput decrease a lot on long text.

Support compressed dictionaries

Add Extended mode

← Metadata

Owner

Metadata

lindera lindera copied to clipboard

Metadata

← Metadata

Owner

Metadata

lindera
lindera copied to clipboard