SudachiTra icon indicating copy to clipboard operation
SudachiTra copied to clipboard

Can I use a user dictionary?

Open mumumu09chi opened this issue 2 years ago • 2 comments

I want to use userdictinary. How to use?

mumumu09chi avatar May 12 '23 04:05 mumumu09chi

BertSudachipyTokenizer takes argument sudachipy_kwargs, that is used to initialize the sudachi tokenizer. https://github.com/WorksApplications/SudachiTra/blob/3f4a6c3a976a2b047a7714192928e7ac229fa699/sudachitra/tokenization_bert_sudachipy.py#L173 https://github.com/WorksApplications/SudachiTra/blob/3f4a6c3a976a2b047a7714192928e7ac229fa699/sudachitra/sudachipy_word_tokenizer.py#L47C1-L71C12

Prepare config file (see user dictionary section) and provide it via config_path like sudachipy_kwargs={"config_path": "path/to/your/config"}.

Note that the final output of sudachiTra tokenizer depends on its vocabulary and user-defined words may be split based on that.

mh-northlander avatar Feb 06 '24 02:02 mh-northlander

With the latest version it is possible to pass sudachipy.config.Config object, passing it (or its json representation) as a config_path parameter. This change was made specially for using Sudachi inside tokenizers while keeping backward compatibility.

eiennohito avatar Feb 07 '24 09:02 eiennohito