stanza
stanza copied to clipboard
Langid model gives languages not in langid_lang_subset on difficult strings
Describe the bug
If you set a lang_subset to the langid processor, it gives as result a language that not always is in the subset
To Reproduce Steps to reproduce the behavior:
import stanza
langid = stanza.Pipeline("multilingual", langid_lang_subset = ["es"])
langid("aaa").lang
The result will be:
la
Expected behavior
The language should be es 100% since I use lang_subset = ["es"]
Environment (please complete the following information):
- Stanza version: 1.4.0
Additional context
I see that here:
https://github.com/stanfordnlp/stanza/blob/011b6c4831a614439c599fd163a9b40b7c225566/stanza/models/langid/model.py#L85is used the self.lang_subset variable that should be ["es"].
While without this subset everything seems to work properly, if self.lang_subset is set not always.
I tried to add some print
def prediction_scores(self, x):
prediction_probs = self(x)
print("prediction_probs")
print(prediction_probs)
if self.lang_subset:
print("lang_mask")
print(self.lang_mask)
prediction_batch_size = prediction_probs.size()[0]
print("prediction_batch_size")
print(prediction_batch_size)
batch_mask = torch.stack([self.lang_mask for _ in range(prediction_batch_size)])
print("batch_mask")
print(batch_mask)
prediction_probs = prediction_probs * batch_mask
print("prediction_probs")
print(prediction_probs)
print("argmax")
print(torch.argmax(prediction_probs, dim=1))
print(self.idx_to_tag[torch.argmax(prediction_probs, dim=1)])
return torch.argmax(prediction_probs, dim=1)
The result is:
prediction_probs
tensor([[-0.8162, -1.6429, 2.5455, -1.8278, -2.5117, -2.9207, -3.7888, -0.8804,
-0.3428, 1.0926, 0.6126, -4.4035, -0.5995, -3.2369, 2.7545, -0.2619,
-0.7622, -1.5564, -4.9973, -0.7835, -5.0576, -4.1394, 0.3663, -1.4397,
-3.2930, -1.5079, -0.7348, -1.1671, 1.4471, -6.8030, 1.9239, -2.3856,
-6.2493, 1.5562, -2.8086, -2.9353, -2.9437, 0.3282, 1.0697, -6.2935,
0.5006, -0.3015, -0.4489, -0.4419, 6.0343, -2.1565, 0.9285, -2.3867,
-2.4929, -0.8634, -3.6259, -4.6344, -0.1315, -4.4329, 0.6783, 0.3423,
-1.2810, -3.3283, -1.4476, -6.0616, -2.5742, -4.2238, -0.5372, -6.5371,
-2.5767, -1.5134, -3.3457, 1.6572]], device='cuda:0',
grad_fn=<SumBackward1>)
lang_mask
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
device='cuda:0')
prediction_batch_size
1
batch_mask
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]],
device='cuda:0')
prediction_probs
tensor([[-0.0000, -0.0000, 0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,
-0.0000, 0.0000, 0.0000, -0.0000, -0.0000, -0.0000, 0.0000, -0.0000,
-0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, 0.0000, -0.0000,
-0.0000, -0.0000, -0.0000, -0.0000, 0.0000, -0.0000, 0.0000, -0.0000,
-0.0000, 0.0000, -0.0000, -0.0000, -0.0000, 0.0000, 0.0000, -0.0000,
0.0000, -0.0000, -0.0000, -0.0000, 0.0000, -0.0000, 0.0000, -0.0000,
-0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, 0.0000, 0.0000,
-0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.5372, -0.0000,
-0.0000, -0.0000, -0.0000, 0.0000]], device='cuda:0',
grad_fn=<MulBackward0>)
argmax
tensor([0], device='cuda:0')
la
I think in this case the string aaa was too difficult for the model (indeed it doesn't mean anything) and so the probability of es was -0.5372 (<0). All the other probability are set to -0.0000 that is an higher value. So when you compute the argmax you'll get the first label which is probably la.
I know that this case can happen rarely, since with a real word the higher probability is usually positive, but I found some similar errors with models trained by me with your training script that sometimes take the first language of the label_list since the languages in the lang_subset have negative probabilities.
I can see exactly where the problem is:
def build_lang_mask(self, use_gpu=None):
"""
Build language mask if a lang subset is specified (e.g. ["en", "fr"])
"""
device = torch.device("cuda") if use_gpu else None
lang_mask_list = [int(lang in self.lang_subset) for lang in self.idx_to_tag] if self.lang_subset else \
[1 for lang in self.idx_to_tag]
self.lang_mask = torch.tensor(lang_mask_list, device=device, dtype=torch.float)
and then
def prediction_scores(self, x):
prediction_probs = self(x)
if self.lang_subset:
prediction_batch_size = prediction_probs.size()[0]
batch_mask = torch.stack([self.lang_mask for _ in range(prediction_batch_size)])
prediction_probs = prediction_probs * batch_mask
return torch.argmax(prediction_probs, dim=1)
maybe negative infinity instead of 0 for illegal languages would work better
definitely not negative infinity considering that turns really unlikely languages into supa likely languages...
Thanks for pointing this out!
This is now fixed on 1.4.1