Inference with self trained langid model
Hi, I trained a langid model with my dataset following these steps and ending with this method:
python -m stanza.models.lang_identifier --data-dir data --eval-length 10 --randomize --save-name model.pt --num-epochs 100
At there is the .pt saved in the directory.
How I can test this new model, making some inferences on some inputs? I saw in the doc how to do that with the standard model, but not with new trained ones. Thank you!
Probably the best way would be:
import stanza
nlp = stanza.Pipeline("multilingual", langid_model_path="model_stanza.pt")
am I right?
If it's loading, then I think that must be right...
Let us know if that works, and we'll update the docs!
On Thu, Jul 7, 2022 at 3:46 AM Paolo Magnani @.***> wrote:
Probably the best way would be:
import stanzanlp = stanza.Pipeline("multilingual", langid_model_path="model_stanza.pt")
am I right?
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1072#issuecomment-1177388286, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLHTE26E7ET72DA6A3VS2YRHANCNFSM52Z37DSQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Yes that is the proper way to load a custom model in Python code.
You can do a more comprehensive eval with this command:
python -m stanza.models.lang_identifier --data-dir data --load-model model.pt --mode eval --eval-length 50 --save-name model-results.jsonl
Full documentation here: https://stanfordnlp.github.io/stanza/langid.html
It works! Is there also easy way to extract the accuracy for each language?
In order to have something like {"en": 0.8, "de": 0.1, etc..}
I've found this way to achieve this task. I wanted to arrive to _model output and have percentages that sum to 1 with a softmax layer.
To do that I overrided _process_list of LangIDProcessor class and prediction_scores of LangIDBiLSTM class. I'm not sure this it's the cleanest way to do that. Do you agree @AngledLuffa ?
Also for the config I have to pass to LangIDProcessor, I'm not sure to have build the object correctly.
import torch
from stanza.pipeline.langid_processor import LangIDProcessor
from stanza.models.common.doc import Document
#taken from https://discuss.pytorch.org/t/apply-mask-softmax/14212/13
def masked_softmax(vec, mask, dim=1):
masked_vec = vec * mask.float()
max_vec = torch.max(masked_vec, dim=dim, keepdim=True)[0]
exps = torch.exp(masked_vec-max_vec)
masked_exps = exps * mask.float()
masked_sums = masked_exps.sum(dim, keepdim=True)
zeros=(masked_sums == 0)
masked_sums += zeros.float()
return masked_exps/masked_sums
def get_predictions_scores(text, pipeline, k):
print(f"Text: {text}")
print(f"Output of the pipeline directly: {pipeline(text).lang}")
config = {}
for key in pipeline.config:
if key.startswith("langid_"):
config[key.split("langid_")[1]] = pipeline.config[key]
else:
config[key] = pipeline.config[key]
processor = LangIDProcessor(config = config, pipeline = pipeline, use_gpu = True)
docs = [text]
#override of _process_list of LangIDProcessor
if isinstance(docs[0], str):
docs = [Document([], text) for text in docs]
docs_by_length = {}
for doc in docs:
text = processor.clean_text(doc.text) if processor._clean_text else doc.text
doc_length = len(text)
if doc_length not in docs_by_length:
docs_by_length[doc_length] = []
docs_by_length[doc_length].append((doc, text))
for doc_length in docs_by_length:
inputs = [doc[1] for doc in docs_by_length[doc_length]]
#override of prediction_scores of LangIDBiLSTM to get the predictions
x = processor._text_to_tensor(inputs)
prediction_probs = processor._model(x)
if processor._model.lang_subset:
prediction_batch_size = prediction_probs.size()[0]
batch_mask = torch.stack([processor._model.lang_mask for _ in range(prediction_batch_size)])
prediction_probs = prediction_probs * batch_mask
prediction_probs = masked_softmax(vec = prediction_probs, mask = batch_mask)
else:
softmax = torch.nn.Softmax(dim = 1)
prediction_probs = softmax(prediction_probs)
topk = torch.topk(prediction_probs, k)
pred_scores = {}
for i,pred in enumerate(topk.indices[0]):
print(f"Language: {processor._model.idx_to_tag[pred]}: {topk.values[0][i]}")
pred_scores[processor._model.idx_to_tag[pred]] = topk.values[0][i].item()
return pred_scores
model_name = "my_model"
pipeline = stanza.Pipeline("multilingual", langid_model_path=model_name+".pt")
pred_scores = get_predictions_scores(text = "hello how are you?", pipeline = pipeline, k=3)
pred_scores
Anyway now I have this output:
{'en': 0.9999321699142456,
'nn': 2.8831263989559375e-05,
'nl': 1.2516763490566518e-05}
EDIT
I used a masked_softmax in case of processor._model.lang_subset is set, taking it from here: https://discuss.pytorch.org/t/apply-mask-softmax/14212/13
without doing that it seems that the percentages were wrong. I don't know if this is the most correct way to do that. This seems to solve also this issue I found https://github.com/stanfordnlp/stanza/issues/1076
Ah, I misunderstood your previous question. You had said you want the accuracy for all languages, but what you want is the predictions for all languages. I should be able to add that functionality to the processor. Even better, would you be up for turning your code in this message into a pull request against the dev branch, stanza/models/langid/model.py and/or stanza/pipeline/langid_processor.py ?
On Wed, Jul 13, 2022 at 3:48 AM Paolo Magnani @.***> wrote:
I've found this way to achieve this task. I wanted to arrive to _model output and have percentages that sum to 1 with a softmax layer.
To do that I overrided _process_list of LangIDProcessor class and prediction_scores of LangIDBiLSTM class. I'm not sure this it's the cleanest way to do that. Do you agree @AngledLuffa https://github.com/AngledLuffa ?
Also for the config I have to pass to LangIDProcessor, I'm not sure to have build the object correctly.
import torchfrom stanza.pipeline.langid_processor import LangIDProcessorfrom stanza.models.common.doc import Document def get_predictions_scores(text, pipeline, k): print(f"Text: {text}") print(f"Output of the pipeline directly: {pipeline(text).lang}")
config = {} for key in pipeline.config: if key.startswith("langid_"): config[key.split("langid_")[1]] = pipeline.config[key] else: config[key] = pipeline.config[key] processor = LangIDProcessor(config = config, pipeline = pipeline, use_gpu = True) docs = [text] #override of _process_list of LangIDProcessor if isinstance(docs[0], str): docs = [Document([], text) for text in docs] docs_by_length = {} for doc in docs: text = processor.clean_text(doc.text) if processor._clean_text else doc.text doc_length = len(text) if doc_length not in docs_by_length: docs_by_length[doc_length] = [] docs_by_length[doc_length].append((doc, text)) for doc_length in docs_by_length: inputs = [doc[1] for doc in docs_by_length[doc_length]] #override of prediction_scores of LangIDBiLSTM to get the predictions x = processor._text_to_tensor(inputs) prediction_probs = processor._model(x) if processor._model.lang_subset: prediction_batch_size = prediction_probs.size()[0] batch_mask = torch.stack([processor._model.lang_mask for _ in range(prediction_batch_size)]) prediction_probs = prediction_probs * batch_mask softmax = torch.nn.Softmax(dim = 1) prediction_probs = softmax(prediction_probs) topk = torch.topk(prediction_probs, k) pred_scores = {} for i,pred in enumerate(topk.indices[0]): print(f"Language: {processor._model.idx_to_tag[pred]}: {topk.values[0][i]}") pred_scores[processor._model.idx_to_tag[pred]] = topk.values[0][i].item() return pred_scores model_name = "my_model"pipeline = stanza.Pipeline("multilingual", langid_model_path=model_name+".pt")pred_scores = get_predictions_scores(text = "hello how are you?", pipeline = pipeline, k=3)pred_scoresAnyway now I have this output:
{'en': 0.9999321699142456, 'nn': 2.8831263989559375e-05, 'nl': 1.2516763490566518e-05}
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1072#issuecomment-1183068227, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLSIOCD7PD23YDZPHLVT2NJRANCNFSM52Z37DSQ . You are receiving this because you were mentioned.Message ID: @.***>
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Ping regarding this - are you interested in making this block of code into a PR?