lit icon indicating copy to clipboard operation
lit copied to clipboard

LIME tokenizer for SentencePiece (or other tokenizer)

Open knok opened this issue 4 years ago • 3 comments

Curently, it seems just use str.split so it didn't work with non-space segmented languages like Japanese.

https://github.com/PAIR-code/lit/blob/3eb824b01e0f72a5486124b16056bf912465debc/lit_nlp/components/citrus/lime.py#L85

I tried to use it with SentencePiece-based model (japanese-ALBERT), but it handle input sentence as single word. I think it would be goot to replace model._model.tokenizer.tokenize instead of str.split.

knok avatar Jun 01 '21 08:06 knok

Unfortunately, the change didn't work well.

knok avatar Jun 01 '21 23:06 knok

For LIME (and other ablation-style techniques), we want to tokenize on full words and not word pieces, which the model tokenizer might do. Is there a simple way to do word-based tokenization for non-space segmented languages?

jameswex avatar Jun 02 '21 12:06 jameswex

I don't know another languages about non-space segmented language (maybe Chinese, Thai, ...), at least Japanese "word" is a little bit ambiguous consept. To segment to words , you need to use morphological analyser like MeCab and dictionaries.

Japanese BERT tokenizer in Transformers uses MeCab and SentencePiiece, but ALBERT is not.

knok avatar Jun 03 '21 00:06 knok