P-tuning Few-shot SuperGLUE的部分数据集效果复现问题

您好，我在复现Few-shot SuperGLUE（即FewGLUE_32dev数据）实验时，CB、WSC、COPA数据集的结果和论文中存在一定差距（复现实验所有模型均基于albert-xxlarge-v2这一个预训练模型，与论文设计一致，实验seed=42无修改）：

实验设置差异：

关于CB数据集的实验

原始脚本中使用8卡/pet_per_gpu_train_batch_size=2/pet_gradient_accumulation_steps=1，在我的实验中使用1卡/pet_per_gpu_train_batch_size=8/pet_gradient_accumulation_steps=2，其余参数无差异；
最终结果acc最高85.71，f1-macro对应78.76，论文结果为92.9/92.3；
在项目的issue中我找到您关于CB数据集效果不如论文的解释：https://github.com/THUDM/P-tuning/issues/12，如果是脚本参数有误造成的，请问什么时候会更新训练脚本呢？

关于WSC数据集的实验

与原脚本参数无差异（1卡/pet_per_gpu_train_batch_size=16/pet_gradient_accumulation_steps=1）；
最终结果acc最高81.73，论文结果为84.6；

关于COPA数据集的实验

与原脚本参数无差异（1卡/pet_per_gpu_train_batch_size=16/pet_gradient_accumulation_steps=1）；
最终结果acc最高79.00，论文结果为87.0；

python库版本差异

考虑到可能存在版本差异影响造成复现效果不同，在此列出与requirements.txt对应的python库版本（括号中为项目requirements的库版本）：

numpy 1.19.5（1.19）
jsonpickle 2.0.0（1.1）
scikit-learn 0.24.1（0.23.1）
torch 1.7.1+cu110（1.5.0）
torchvision 0.8.2+cu110（0.6.0）
transformers 4.5.1（3.0.2）
tqdm 4.49.0（4.48.1）
tensorboardX 2.2（2.1）由于设备cuda版本受限，torch相关库的版本与代码不同；而其他部分库如tqdm、tensorboardX等应该与效果无关。不知道是否是因为以上库版本差异导致效果不同？

设备差异

全部复现实验在单张GeForce RTX 3090上进行。

请问如何理解模型效果的差异？

May 01 '21 12:05 Riroaki

正好有同样的问题想问。我这边在SuperGLUE上的实验发现有几个数据集分数与随机数种子有很高的关联性（与使用的代码关联性就更高了，用Jiant和Allennlp跑出来分数差异也有几个点）。CB这个数据集甚至能从70多波动到90多。不知道作者是怎么处理这些随机因素的？

May 08 '21 08:05 slczgwh

CB这个数据集，只用BERT-BASE-UNCASE跑十次随机数种子，差别也能到这个程度（Jiant的结果）。

f13_bb

0.912281 0.866667 0.867925 0.945455 0.867925 0.857143 0.915254 0.912281 0.836364 0.912281

May 08 '21 08:05 slczgwh

请问下你跑的时候emb size设置的是768吗，其他代码有改动吗？我这边跑的rte，指标很低只有三四十不知为何

Sep 26 '21 09:09 ywb2018

请问下你跑的时候emb size设置的是768吗，其他代码有改动吗？我这边跑的rte，指标很低只有三四十不知为何

确实比较神奇，我试着用cb这个script 跑了，发现报错，prompt embedding 默认值是128，因此替换bert embedding对不上，但是cb 这个script 它又不指定embedding 这个参数值？

Oct 09 '21 03:10 rookiebird

Thanks for your great work in reproducing P-tuning for few-shot SuperGLUE. In practice, we find few-shot learning's reproducibility extremely relates with environmental setting, hyper-parameters (e.g., batch-sizes, gradient-accumulation-step) and number of parallel GPUs. For example, in our experiment we use 8 V100 GPUs for a single dataset training, and if less GPUs or different type of GPUs are used, the performance can varies greatly.

In light of the volatility challenge, in the following work FewNLU @zheng-yanan present a more robust evaluation framework for few-shot SuperGLUE. P-tuning is also re-implemented in the FewNLU framework. Please check it if you have trouble setting up the same environment for fair comparison.

Dec 07 '21 11:12 Xiao9905

rompt embedding

请问prompt embedding 的大小需要设置和预训练模型的embedding_dim一样吗？直接拿作者的代码跑，会报错，维度不匹配

May 25 '22 06:05 SCU-JJkinging

P-tuning P-tuning copied to clipboard

Few-shot SuperGLUE的部分数据集效果复现问题

实验设置差异：

关于CB数据集的实验

关于WSC数据集的实验

关于COPA数据集的实验

python库版本差异

设备差异

f13_bb

P-tuning
P-tuning copied to clipboard