bonito icon indicating copy to clipboard operation
bonito copied to clipboard

Question about the crf model

Open HaoDreamlong opened this issue 3 years ago • 10 comments

In the crf model of version v0.3.2, the encoder is endswith a Tanh layer and Scale layer. Why is it necessary to add these two layers?

HaoDreamlong avatar Jan 04 '21 01:01 HaoDreamlong

This constrains the output scores to lie in a range given by the scale factor - e.g. for Scale(5.0) this is a soft clipping function to the range (-5.0, 5.0). Scores are in log space and this should allow plenty of dynamic range whilst improving training stability but it's possible the Tanh layer could be removed.

davidcpage avatar Jan 04 '21 12:01 davidcpage

Thank you for your reply,I have one more question about the model of the GlobalNorm layer. I read the pytorch version of the logZ calculation in seqdist.sparse ,I guess it is for some sort of normalization usage. Does it have something to do with the Scale layer? and what is the exactly usage for the GlobalNorm layer?

HaoDreamlong avatar Jan 05 '21 01:01 HaoDreamlong

And about the first question. If it is for the log space posibilities, isn't it should be in range of (0,1)? and log_p should less than 0?

HaoDreamlong avatar Jan 05 '21 01:01 HaoDreamlong

The outputs of the network represent scores in a linear-chain CRF. You can use them to compute the log probability of a particular (aligned) output sequence by adding log scores for the transitions at each timestep and subtracting the log of the global sum over (aligned) sequence scores, logZ. Scale() controls dynamic range of the log scores but these do not lie in (0,1) as they are not log probs.

davidcpage avatar Jan 05 '21 12:01 davidcpage

oh I get it. So since the output repesent the log scores, the loss of the model is the sum of correct paths( aligned? ) scores,and backwards is making the -loss smaller meanwhile making the correct paths reach highest scores. And the decoder should work as finding the way which has highest score. Is it a proper description?

HaoDreamlong avatar Jan 06 '21 01:01 HaoDreamlong

Yes, that is right @HaoDreamlong

iiSeymour avatar Jan 06 '21 10:01 iiSeymour

Thank you very much. The variables named stay_indices/scores and move_indices/scores, I have a little problem understanding them. Since the stay_indices is representing 5-position 4 hex ,and the move_indices is stay_indices add previous step value . At some extrem situation like stay_indices=341(1 1 1 1 1) and move_indices=342(1 |1 1 1 1 1). Don't they represent the same situation?

HaoDreamlong avatar Jan 07 '21 01:01 HaoDreamlong

We distinguish between being in state 1 1 1 1 1 and emitting a blank symbol (stay_index/score) and being in state 1 1 1 1 1 and emitting a 1 symbol (move_index/score). This leads to the same pair of before and after states, but a different emitted sequence. The inclusion of a blank symbol makes this a kind of CTC model except here the conditional independence condition is replaced with a CRF.

davidcpage avatar Jan 08 '21 17:01 davidcpage

The model's decode function, only simply calculate logZ(for different S)twice, and obtain the gradient by auto_grad. It is hard to understand how this work as viterbi decoder. Could you tell me why such a delicate algorithm can produce the right answer?

HaoDreamlong avatar Jan 12 '21 06:01 HaoDreamlong

@davidcpage The rnn model doesn't need chunk_lengths, is it because the rnn can deal with the blank padding at the end of the input? or I have to make the input full of useful information?

HaoDreamlong avatar Jan 18 '21 09:01 HaoDreamlong