inseq
inseq copied to clipboard
Slow `DiscretizedIntegratedGradientAttribution` method, also on GPU
🐛 Bug Report
Inference on a google colab GPU is very slow. There is no significant difference if the model runs on cuda or CPU
🔬 How To Reproduce
The following model.attribute(...)
code runs for around 33 to 47 seconds both on a colab CPU or GPU. I tried passing the device to the model and the model.device confirms that it's running on cuda, but it still takes very long to run only 2 sentences. (I don't know the underlying computations for attribution enough to know if this is to be expected, or if this should be faster. If it's always that slow, then it seems practically infeasible to analyse larger corpora)
import inseq
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print(inseq.list_feature_attribution_methods())
model = inseq.load_model("google/flan-t5-small", attribution_method="discretized_integrated_gradients", device=device)
model.to(device)
out = model.attribute(
input_texts=["We were attacked by hackers. Was there a cyber attack?", "We were not attacked by hackers. Was there a cyber attack?"],
)
model.device
Environment
- OS: linux, google colab
- Python version: Python 3.8.10
- Inseq version: 0.3.3
Expected behavior
Faster inference with a GPU/cuda
(Thanks btw, for the fix for returning the per-token scores in a dictionary, the new method works well :) )
Hi @MoritzLaurer , thanks for your comment!
The slowness you report is most likely specific to the discretized_integrated_gradient
method, since the current implementation builds non-linear interpolation paths in a sequential manner. We currently have issue #113 tracking a bug with batching with this method, and we are in touch with the authors.
In the meantime, I suggest using the more common saliency
or integrated_gradients
approach that should be considerably faster on GPU. Bastings et al. 2022 shows how Gradient L2 (the default outcome using saliency
in Inseq since v0.3.3) works well in terms of faithfulness on Transformer-based classifiers, so that could be a good starting point! Alternatively, attention
attribution only requires forward passes, but it's less principled.
Hope it helps!
ok, thanks, will try the other methods. (good to know that there might be a fix at some point, in my ad-hoc tests the discretized_integrated_gradient method seems to make the most interpretable attributions)