Xingchen Song(宋星辰)
Xingchen Song(宋星辰)
@Chung-I i notice that u used cross entropy in ocd_loss rather than KLdivergence( which is official in paper 'Optimal Completion Distillation for sequence learning') , is this PR a right...
ocd_loss should be like this ? `optimal_probs = F.softmax(q_val / temp, dim=-1) ` `loss += ( optimal_probs * (torch.log(optimal_probs)- F.log_softmax(out_probs[b,:len_sample,:])) ).sum(dim=-1).mean() `
> Yes, as the paper indicated, the loss they used is KL divergence; however, when performing backprop in this scenario, the two losses are actually equivalent in terms of gradient...
we convert fractions with two-stage method: - stage-1 : tag and construct fraction structure - "三分之二" ==> “fraction { denominator: "3" frac: "/" numerator: "2" }” - stage-2 : reorder...
Hi,u can add "3721 三七二十一 " to `chinese_text_normalization/thrax/src/cn/hotfix.list` and re-compile this project. This is a dirty work-around.
met same issue
`is_masked = torch.ByteTensor(feature.pop("is_masked").copy().astype(np.uint8))` @menggehe
> the q/k/v wrong!! can u point out where the wrong code is? I compared this IMPL with ZihangDai'IMPL and didn't find anything wrong.
@graykode Hi~ GREAT THX for ur Pytorch IMPL of XLNet, I wonder whether u have a plan to impl fine-tuning part in pytorch?
Hi, currently I'm working on other projects, will keep on tracing when I have time.