portuguese-bert
portuguese-bert copied to clipboard
bert-crf
Hello in this part of your code in bertcrf class (forward fn), you write it is for pass the first token but i don't understand how this hanpped (and the len of seq_logist and seq_lables doed not change it the same length with sub tokens , CLS and SEP )
"for seq_logits, seq_labels, seq_mask in zip(logits, labels, mask): # Index logits and labels using prediction mask to pass only the # first subtoken of each word to CRF. seq_logits = seq_logits[seq_mask].unsqueeze(0) seq_labels = seq_labels[seq_mask].unsqueeze(0) loss -= self.crf(seq_logits, seq_labels, reduction='token_mean')"
Hi @Phd-Student2018 , I don't know if I understood your question, but here is an example of this indexing:
suppose we have the following words, tokens and labels
words = ["My", "name", "is", "Fabio"]
tokens = ["[CLS]", "My", "name", "is", "Fa", "##bio", "[SEP]"]
label_tags = ["X", "O", "O", "O", "B-PERSON", "X", "X"] # X is ignore
labels = [-100, 0, 0, 0, 1, -100, -100] # label tags converted to int ids
seq_mask = [False, True, True, True, True, False, False] # False for special tokens and word continuations ("##")
# The CRF layer must receive only the logits and labels of the tokens ["My", "name", "is", "Fa"]
# B = batch size
# S = sequence length
# C = number of classes/tags
# logits.shape == (B, S, C)
# labels.shape == (B, S)
# After zip:
# seq_logits.shape == (S, C)
# seq_labels.shape == (S,)
# The indexing of seq_logits and seq_labels by seq_mask will produce:
# seq_logits.shape == (P, C)
# seq_labels.shape == (P,)
# The unsqueeze adds back the batch dimension: (1, P, C) and (1, P)
P
is the number of words given by basic whitespace and punctuation tokenization, P = seq_mask.sum()
Hope it helps
Yes , it is very helpful Thank you very much
please , another question for testing ,to compare prediction list with original label list(y-true ), how we can get y-true