portuguese-bert icon indicating copy to clipboard operation
portuguese-bert copied to clipboard

Question on crf layer, why loop through batch before crf layer?

Open lkqnaruto opened this issue 2 years ago • 4 comments

I checked out the code, where you said :

 # 1- the CRF package assumes the mask tensor cannot have interleaved
# zeros and ones. In other words, the mask should start with True
# values, transition to False at some moment and never transition
# back to True. That can only happen for simple padded sequences.
# 2- The first column of mask tensor should be all True, and we
# cannot guarantee that because we have to mask all non-first
# subtokens of the WordPiece tokenization.

Can you explain a little bit on that? I'm still confused what you mean here. What does it mean by "interleaved zeros and ones"?

Thank you

lkqnaruto avatar Nov 27 '21 04:11 lkqnaruto

Hi @lkqnaruto ,

By interleaved zeros and ones, I meant a mask like [0, 1, 0, 1, 1, 0, 0, 0, 1, ...] instead of [1, 1, 1, 1, 0, 0, 0]. Because we are using WordPiece that is subword tokenization, all word continuation tokens (that start with ##) do not have an associated tag prediction for NER task, otherwise words that are tokenized into 2+ tokens would have multiple predictions.

For instance, suppose we have these tokens: tokens = ["[CLS]", "Al", "##bert", "Ein", "##stein", ...] The mask would be mask = [0, 1, 0, 1, 0, ...] which is incompatible with the CRF package. So we have index the sequence using the mask and pass only ["Al", "Ein", ...] to CRF.

The mask is different for each sequence of the batch and have different lengths (sum of 1's), so this masking is not trivial to do without an explicit for loop.

fabiocapsouza avatar Nov 27 '21 19:11 fabiocapsouza

Hi @lkqnaruto ,

By interleaved zeros and ones, I meant a mask like [0, 1, 0, 1, 1, 0, 0, 0, 1, ...] instead of [1, 1, 1, 1, 0, 0, 0]. Because we are using WordPiece that is subword tokenization, all word continuation tokens (that start with ##) do not have an associated tag prediction for NER task, otherwise words that are tokenized into 2+ tokens would have multiple predictions.

For instance, suppose we have these tokens: tokens = ["[CLS]", "Al", "##bert", "Ein", "##stein", ...] The mask would be mask = [0, 1, 0, 1, 0, ...] which is incompatible with the CRF package. So we have index the sequence using the mask and pass only ["Al", "Ein", ...] to CRF.

The mask is different for each sequence of the batch and have different lengths (sum of 1's), so this masking is not trivial to do without an explicit for loop.

Thank you for the reply, a follow up question: Why we have to loop through each batch before crf? I think crf package can handle batch-wise calculation.

lkqnaruto avatar Nov 29 '21 03:11 lkqnaruto

Hi @fabiocapsouza,

I'm experimenting with different ways of subword handling for CRF layer. Why have you chosen to just take first subtoken? Wouldn't some sort of pooling of subword representations work better?

I would greatly appreciate if you could share your thoughts on the matter!

ViktorooReps avatar Dec 10 '21 13:12 ViktorooReps

Hi @ViktorooReps , I used the first subtoken because it is the way BERT does it for NER, so it is the simplest way to add CRF on top of it. Yeah, maybe some sort of pooling could be better, even though subword representations are already contextual. It would be a nice experiment.

fabiocapsouza avatar Dec 11 '21 20:12 fabiocapsouza