transformer-xl
transformer-xl copied to clipboard
parameter cutoff in the function single_core_graph
@kimiyoung Can you explain the meaning and usage of the parameter cutoffs
in the function single_core_graph
? Can you provide some examples? Thanks
Hi, any answer by the authors is going to be more accurate, but since I have looked up Adaptive Softmax I can comment on this and possibly help.
cutoffs
is used to partition the vocabulary into groups, depending on word frequency.
E.g.: [0, 20000, 40000, 200000, 267735]
means that Group_1 contains the 20,000 most common words, whereas Group_4 contains the 267735-200000=67,735 most rare words.
Why these groups? To speed up the computation of softmax-like probabilities, at the cost of a minor drop in accuracy. As you know, softmax is computed on the logits as: softmax(x_i) = e^{x_i} / \sum_{forall j != i}{ e^{x_j} } Instead of computing the softmax formula on each element, Adaptive softmax assigns a part of the probability distribution (e.g. 0.3) to a group. Each word of the group will later be assigned its own probability, taking up a portion of the probability mass of its group (e.g. 0.3)
Thank you @AndreaLK3
Thank you @AndreaLK3