lit Excessive Duplicated Sentences in LIME Text Output

Excessive Duplicated Sentences in LIME Text Output

Open smbslt3 opened this issue 11 months ago • 3 comments

I'm using LIME text to explain the results of sentiment analysis. When testing various sentences, I've noticed an excessive number of duplicated sentences being used as inputs for LIME text. This code is my setting(model is ELECTRA model, not fine-tuned)

from lime.lime_text import LimeTextExplainer

model.eval()  # 모델을 평가 모드로 설정
model.to(DEVICE)

class_names = ['pos', 'neg']explainer = LimeTextExplainer(class_names=class_names, 
                              bow=False,              # If True, masks all instances of the same word in a sentence simultaneously
                              mask_string = '_',    # Default is UNKWORDZ, let's change it to a special token present in the model
                              random_state = 124)  # Ensures reproducibility of the explanation results


from transformers import AutoTokenizer
import torch

def pred_proba_for_lime(sentences, model=model, tokenizer=tokenizer, device=DEVICE):

    # Count the number of each sentence variations.
    counter = {}
    for s in (sentences):
        if s in counter.keys():
            counter[s] += 1
        else:
            counter[s] = 1
    
    print(pd.DataFrame(counter.items(), columns = ['sentence', 'freq']).sort_values(by='freq', ascending=False))
    
    
    # 문장들을 모델 입력 형태로 변환
    inputs = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors="pt")
    input_ids = inputs['input_ids'].to(device)
    attention_mask = inputs['attention_mask'].to(device)
    
    with torch.no_grad():  # 기울기 계산 비활성화
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        probs = torch.nn.functional.softmax(logits, dim=-1).cpu().numpy()  # softmax를 통해 확률 계산
    
    return probs

I have set bow=False to treat same words different by there position in sentence, and set mask_string = '_' for following masking validation.

For instance, here is a short sentence example:

input_doc = '''The whole new iPhone 15 Pro is awesome'''
explainer.explain_instance(input_doc, pred_proba_for_lime).show_in_notebook(text=True)

In this case, there are 599 duplicates among the various masked sentences generated. Even more concerning is that the most frequently duplicated sentence does not use any tokens at all.

Additionally, here is an example with a longer sentence:

input_doc = '''The whole new iPhone 15 Pro is awesome, truly setting a new benchmark in the world of smartphones. 
With its cutting-edge technology and innovative design, it stands out as a masterpiece of modern engineering. 
From its sleek, robust exterior to the advanced internal components, every aspect of the iPhone 15 Pro is designed to impress. '''
explainer.explain_instance(input_doc, pred_proba_for_lime).show_in_notebook(text=True)

While the frequency of duplication has decreased with longer sentences, there are still a significant number of sentences that are duplicated. Notably, the most duplicated cases include sentences that, aside from newline (\n) and backtick (```) characters, contain no tokens.

LIME is expected to mask n tokens randomly, but the outcomes don't seem random. Is this normal or a malfunction? If it's a malfunction, is it okay to remove duplicates manually for a unique sentence set? This might significantly cut down on LIME's execution time if it's unnecessary to rerun duplicate sentences.

Mar 28 '24 07:03 smbslt3

lit lit copied to clipboard

Excessive Duplicated Sentences in LIME Text Output

lit
lit copied to clipboard