Yushi Bai
Yushi Bai
由于不同模型的tokenizer不同,为了统一长度测度,我们汇报的avg length对于中/英数据集分别是是字数/单词数。
Thanks for your attention. Our paper is under review now so we are currently not open to result submission. Nevertheless, we encourage you to add these evaluation results to your...
I guess there is a difference in the evaluation setting. In our experiment, we measure the prediction accuracy **only on the tail entity (t)** in the test triplets, following previous...
Correct me if i'm wrong, but I think MetaQA mainly focuses on testing how well a QA system can take a natural-language (NL) form multi-hop query and ground the NL...
I will try to navigate the problem about GPU usage when activating iterative training. The hyperparameters on all 4 datasets are shown in the command line in README.md, you will...
BTW, you can also find the best hyperparameters in Table 8 of our paper (appendix).
Good question! SFT中算loss通常来讲都是样本内作token-level mean,样本间作sequence-level mean,也就是等式(2)的计算方式。如果不同样本间作token-level mean,则会使target token数量多的样本更受重视(相当于被upsample),从而引入不同样本间的不平衡。如果按照你说的"target token loss 总和 / target token 总数"的总loss计算方式,只需要将代码中对每个样本原本作mean得到的token-level loss替换为作sum得到的target token loss总和即可。
在packing训练使用loss weighting时`self.pack_loss`会被置为`True`,请参考`if self.pack_loss:`下的代码,我们首先对一个样本内每个target token上的loss乘以weight并求sum,然后多gpu上的不同样本的loss在transformer.Trainer中被作mean。
Seems like this PR also aims to mitigate the gradient accumulation issue in `transformers`: https://github.com/huggingface/transformers/pull/34191
您好,很抱歉我现在找不到当时跑的环境了,除了基础的torch,numpy这些,再装个transformers库就可以了,比较新的版本应该都可以跑。您如果在环境上有遇到具体问题欢迎追问!