lit
lit copied to clipboard
Excessive Duplicated Sentences in LIME Text Output
I'm using LIME text to explain the results of sentiment analysis. When testing various sentences, I've noticed an excessive number of duplicated sentences being used as inputs for LIME text. This code is my setting(model is ELECTRA model, not fine-tuned)
from lime.lime_text import LimeTextExplainer
model.eval() # 모델을 평가 모드로 설정
model.to(DEVICE)
class_names = ['pos', 'neg']explainer = LimeTextExplainer(class_names=class_names,
bow=False, # If True, masks all instances of the same word in a sentence simultaneously
mask_string = '_', # Default is UNKWORDZ, let's change it to a special token present in the model
random_state = 124) # Ensures reproducibility of the explanation results
from transformers import AutoTokenizer
import torch
def pred_proba_for_lime(sentences, model=model, tokenizer=tokenizer, device=DEVICE):
# Count the number of each sentence variations.
counter = {}
for s in (sentences):
if s in counter.keys():
counter[s] += 1
else:
counter[s] = 1
print(pd.DataFrame(counter.items(), columns = ['sentence', 'freq']).sort_values(by='freq', ascending=False))
# 문장들을 모델 입력 형태로 변환
inputs = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors="pt")
input_ids = inputs['input_ids'].to(device)
attention_mask = inputs['attention_mask'].to(device)
with torch.no_grad(): # 기울기 계산 비활성화
outputs = model(input_ids, attention_mask=attention_mask)
logits = outputs.logits
probs = torch.nn.functional.softmax(logits, dim=-1).cpu().numpy() # softmax를 통해 확률 계산
return probs
I have set bow=False
to treat same words different by there position in sentence, and set mask_string = '_'
for following masking validation.
For instance, here is a short sentence example:
input_doc = '''The whole new iPhone 15 Pro is awesome'''
explainer.explain_instance(input_doc, pred_proba_for_lime).show_in_notebook(text=True)
In this case, there are 599 duplicates among the various masked sentences generated. Even more concerning is that the most frequently duplicated sentence does not use any tokens at all.
Additionally, here is an example with a longer sentence:
input_doc = '''The whole new iPhone 15 Pro is awesome, truly setting a new benchmark in the world of smartphones.
With its cutting-edge technology and innovative design, it stands out as a masterpiece of modern engineering.
From its sleek, robust exterior to the advanced internal components, every aspect of the iPhone 15 Pro is designed to impress. '''
explainer.explain_instance(input_doc, pred_proba_for_lime).show_in_notebook(text=True)
While the frequency of duplication has decreased with longer sentences, there are still a significant number of sentences that are duplicated. Notably, the most duplicated cases include sentences that, aside from newline (\n) and backtick (```) characters, contain no tokens.
LIME is expected to mask n tokens randomly, but the outcomes don't seem random. Is this normal or a malfunction? If it's a malfunction, is it okay to remove duplicates manually for a unique sentence set? This might significantly cut down on LIME's execution time if it's unnecessary to rerun duplicate sentences.