stanza Langid model gives languages not in langid_lang

Describe the bug If you set a lang_subset to the langid processor, it gives as result a language that not always is in the subset

To Reproduce Steps to reproduce the behavior:

import stanza
langid = stanza.Pipeline("multilingual", langid_lang_subset = ["es"])
langid("aaa").lang

The result will be:

la

Expected behavior The language should be es 100% since I use lang_subset = ["es"] Environment (please complete the following information):

Stanza version: 1.4.0

Additional context I see that here: https://github.com/stanfordnlp/stanza/blob/011b6c4831a614439c599fd163a9b40b7c225566/stanza/models/langid/model.py#L85is used the self.lang_subset variable that should be ["es"]. While without this subset everything seems to work properly, if self.lang_subset is set not always.

I tried to add some print

    def prediction_scores(self, x):
        prediction_probs = self(x)
        print("prediction_probs")
        print(prediction_probs)
        if self.lang_subset:
            print("lang_mask")
            print(self.lang_mask)
            prediction_batch_size = prediction_probs.size()[0]
            print("prediction_batch_size")
            print(prediction_batch_size)
            batch_mask = torch.stack([self.lang_mask for _ in range(prediction_batch_size)])
            print("batch_mask")
            print(batch_mask)
            prediction_probs = prediction_probs * batch_mask
            print("prediction_probs")
            print(prediction_probs)
        print("argmax")
        print(torch.argmax(prediction_probs, dim=1))
        print(self.idx_to_tag[torch.argmax(prediction_probs, dim=1)])
        return torch.argmax(prediction_probs, dim=1)

The result is:

prediction_probs
tensor([[-0.8162, -1.6429,  2.5455, -1.8278, -2.5117, -2.9207, -3.7888, -0.8804,
         -0.3428,  1.0926,  0.6126, -4.4035, -0.5995, -3.2369,  2.7545, -0.2619,
         -0.7622, -1.5564, -4.9973, -0.7835, -5.0576, -4.1394,  0.3663, -1.4397,
         -3.2930, -1.5079, -0.7348, -1.1671,  1.4471, -6.8030,  1.9239, -2.3856,
         -6.2493,  1.5562, -2.8086, -2.9353, -2.9437,  0.3282,  1.0697, -6.2935,
          0.5006, -0.3015, -0.4489, -0.4419,  6.0343, -2.1565,  0.9285, -2.3867,
         -2.4929, -0.8634, -3.6259, -4.6344, -0.1315, -4.4329,  0.6783,  0.3423,
         -1.2810, -3.3283, -1.4476, -6.0616, -2.5742, -4.2238, -0.5372, -6.5371,
         -2.5767, -1.5134, -3.3457,  1.6572]], device='cuda:0',
       grad_fn=<SumBackward1>)
lang_mask
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       device='cuda:0')
prediction_batch_size
1
batch_mask
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]],
       device='cuda:0')
prediction_probs
tensor([[-0.0000, -0.0000,  0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,
         -0.0000,  0.0000,  0.0000, -0.0000, -0.0000, -0.0000,  0.0000, -0.0000,
         -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,  0.0000, -0.0000,
         -0.0000, -0.0000, -0.0000, -0.0000,  0.0000, -0.0000,  0.0000, -0.0000,
         -0.0000,  0.0000, -0.0000, -0.0000, -0.0000,  0.0000,  0.0000, -0.0000,
          0.0000, -0.0000, -0.0000, -0.0000,  0.0000, -0.0000,  0.0000, -0.0000,
         -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,  0.0000,  0.0000,
         -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.5372, -0.0000,
         -0.0000, -0.0000, -0.0000,  0.0000]], device='cuda:0',
       grad_fn=<MulBackward0>)
argmax
tensor([0], device='cuda:0')
la

I think in this case the string aaa was too difficult for the model (indeed it doesn't mean anything) and so the probability of es was -0.5372 (<0). All the other probability are set to -0.0000 that is an higher value. So when you compute the argmax you'll get the first label which is probably la.

I know that this case can happen rarely, since with a real word the higher probability is usually positive, but I found some similar errors with models trained by me with your training script that sometimes take the first language of the label_list since the languages in the lang_subset have negative probabilities.

Jul 13 '22 17:07 paulthemagno

I can see exactly where the problem is:

    def build_lang_mask(self, use_gpu=None):
        """
        Build language mask if a lang subset is specified (e.g. ["en", "fr"])
        """
        device = torch.device("cuda") if use_gpu else None
        lang_mask_list = [int(lang in self.lang_subset) for lang in self.idx_to_tag] if self.lang_subset else \
                         [1 for lang in self.idx_to_tag]
        self.lang_mask = torch.tensor(lang_mask_list, device=device, dtype=torch.float)

and then

    def prediction_scores(self, x):
        prediction_probs = self(x)
        if self.lang_subset:
            prediction_batch_size = prediction_probs.size()[0]
            batch_mask = torch.stack([self.lang_mask for _ in range(prediction_batch_size)])
            prediction_probs = prediction_probs * batch_mask
        return torch.argmax(prediction_probs, dim=1)

maybe negative infinity instead of 0 for illegal languages would work better

Jul 13 '22 21:07 AngledLuffa

definitely not negative infinity considering that turns really unlikely languages into supa likely languages...

Jul 13 '22 21:07 AngledLuffa

Thanks for pointing this out!

Jul 14 '22 05:07 J38

This is now fixed on 1.4.1

Sep 14 '22 19:09 AngledLuffa

stanza
stanza copied to clipboard

Langid model gives languages not in langid_lang_subset on difficult strings

stanza stanza copied to clipboard

Langid model gives languages not in langid_lang_subset on difficult strings

stanza
stanza copied to clipboard