LIME tokenizer for SentencePiece (or other tokenizer)
Curently, it seems just use str.split so it didn't work with non-space segmented languages like Japanese.
https://github.com/PAIR-code/lit/blob/3eb824b01e0f72a5486124b16056bf912465debc/lit_nlp/components/citrus/lime.py#L85
I tried to use it with SentencePiece-based model (japanese-ALBERT), but it handle input sentence as single word.
I think it would be goot to replace model._model.tokenizer.tokenize instead of str.split.
Unfortunately, the change didn't work well.
For LIME (and other ablation-style techniques), we want to tokenize on full words and not word pieces, which the model tokenizer might do. Is there a simple way to do word-based tokenization for non-space segmented languages?
I don't know another languages about non-space segmented language (maybe Chinese, Thai, ...), at least Japanese "word" is a little bit ambiguous consept. To segment to words , you need to use morphological analyser like MeCab and dictionaries.
Japanese BERT tokenizer in Transformers uses MeCab and SentencePiiece, but ALBERT is not.