what is the difference between softmax and marginal mode in crf
the two modes in crf are join mode and marginal mode,but i am not clear about the marginal mode,and i use this mode to train the data, the result is almost same to the softmax, anyone can explain it?
@KARABAER join mode is the real CRF by defnition, and the Viterbi algorithm gives you the usual prediction result, which is not softmax but best path. Since CRF is non-direct graph that models join distribution of all time slices, you can not get independent probability distribution for each time slice like RNN. But of course you can talk about marginal distribution of each time slice either in training time or prediction time (see argument test_mode). The marginal mode gives marginal distribution for each time slice at training time and uses the usual categorical crossentropy as loss, which is NOT the real CRF by definition, but is an approximation that some implementation does as it fits the neural network (direct graph) architect. It is similar to something called sum-product algorithm.
@linxihui I don't understand how marginal mode works. How do we calculate marginal probabilities after CRF layer at each time step?
@linxihui But there is mathematical formulation in a linear chain CRF to find the marginal probabilities at each time step. Please refer to this. What we need to calculate this are the forward path scores and the backward path scores (alpha and beta respectively).
Say, we want to find the marginal distribution at time step i, then we need alpha i-1 and beta i+1 along with the unary score at time step i.
What I wanted to ask @lzfelix was, if it's possible to do this in the current keras implementation. To be more specific, I need to calculate marginal probabilities in the join mode. Is it possible to do so with the current implementation?
To the best of my knowledge, if you want the probability distribution of observing each of the C possible labels at the i-th timestep, then you are going to need to compute the forward pass, ie. alpha, until timestep i, you don't need beta at all. Just remember that alpha computes the probability of arriving at some label at the i-th timestep from the beginning of the sequence, while beta[i] yields the probability of starting the sequence at the i-th timestep and observing some label in there and arriving at the end of the sequence, assuming that your sample is prepended and appended with special tokens that mark begin and end of sequence. You can find more details regarding these details on Jacob Eisenstein's NLP book, which is public and it can be found here.
Regarding the second part of your question, where you want to compute the probabilities of observing some desired label at the i-th timestep using the join mode, which in turn uses the the forward-backward, or, more specifically, the alpha computations, I do not this that this is possible right now. Maybe you could use the marginal mode as an approximation, as originally suggested by @linxihui ?
Since the performance of join mode is much better than marginal mode, we prefer to use join mode but the marginal probability is also important in application. I also wonder how CRF++ toolkit calculates the probability and can we learn from it.