captum icon indicating copy to clipboard operation
captum copied to clipboard

Different attributions for IntegratedGradients/LayerIntegratedGradients for BERT embeddings.

Open phiwi opened this issue 1 year ago • 0 comments

❓ Questions and Help

When doing inference on a trained BertForSequenceClassification model (which has a BertModel as its base), I get slightly different results for

  1. IntegratedGradients and inputting embeddings
  2. LayerIntegratedGradients initialized for the model.bert.embeddings layer and inputting input ids

In the following "ig" stands for IntegratedGradients and "lig" for LayerIntegratedgradients.

Forward functions:

def ig_func(inputs):
    pred = model(inputs)
    return pred.logits

def ig_embed_func(inputs):
    pred = model(inputs_embeds=inputs)
    return pred.logits

Attribution class:

lig = LayerIntegratedGradients(ig_func, model.bert.embeddings)
ig = IntegratedGradients(ig_embed_func)

Input representations:

input_ids = ...
ref_input_ids = token_reference.generate_reference(input_ids.size(-1), device=device).unsqueeze(0)
input_embeds = model.get_input_embeddings()(input_ids)
ref_input_embeds = model.get_input_embeddings()(ref_input_ids)

Calculation:

lig_attributions, delta = lig.attribute(
    inputs=input_ids,
    baselines=ref_input_ids,
    additional_forward_args=None,
    return_convergence_delta=True,
    n_steps=50,
    target=0
)
ig_attributions, delta = ig.attribute(
    inputs=input_embeds,
    baselines=ref_input_embeds,
    additional_forward_args=None,
    return_convergence_delta=True,
    n_steps=50,
    target=0
)

Summation:

ig_attributions_sum = ig_attributions.sum(dim=-1).squeeze(0)
lig_attributions_sum = lig_attributions.sum(dim=-1).squeeze(0)

Output

>>> ig_attributions_sum[:10], lig_attributions_sum[:10]
(tensor([ 0.0755, -0.0366, -0.0526,  0.1775,  0.0118,  0.0393,  0.0243,  0.0654,
          0.0490,  0.0587], dtype=torch.float64, grad_fn=<SliceBackward0>),
 tensor([ 0.0757, -0.0333, -0.0524,  0.0858,  0.0139,  0.0132,  0.0586,  0.0767,
          0.0621,  0.0593], dtype=torch.float64))

The outputs are "similar" but not identical.

To the best of my knowledge, the differences occur with any input and any other of our trained models.

Background: We came across that issue when comparing the calculation of IG attributions in the ferret tool (which uses IntegratedGradients) to the ones Captum proposes (using LayerIntegratedGradients) and found the above mentioned differences.

phiwi avatar Apr 12 '23 09:04 phiwi