MOSS-RLHF
MOSS-RLHF copied to clipboard
关于rm中lm loss计算的疑问
在reward_trainer.py这里,删除了lm_logits中最后一个token的概率分布,但是在下面的label里面是删除了第一个词,想问下这里是怎么对应的呢
Wow. It looks like the issue is in the original huggingface datasets. Lots of duplicates too.
Here are the stats using only huggingface datasets.load_dataset()
:
from datasets import load_dataset
from tabulate import tabulate
# dataset info from data.py:
# [name, args to load_dataset(), keys used on each item]
ds_info = [
("filipino",
['dengue_filipino'],
['text', 'absent', 'dengue', 'health', 'mosquito', 'sick']),
("kirnews",
["kinnews_kirnews","kirnews_cleaned"],
['label','title','content']),
("kinnews",
["kinnews_kirnews", "kinnews_cleaned"],
['label','title','content']),
("swahili",
['swahili_news'],
['label','text']),
]
lines = []
for name,args,keys in ds_info:
ds = load_dataset(*args)
# convert to list-of-tuples:
train = [tuple([item[key] for key in keys]) for item in ds['train']]
test = [tuple([item[key] for key in keys]) for item in ds['test']]
lines.append(name)
n_overlap = len(set(train).intersection(test))
lines.append(tabulate([
("train:", len(train)),
("train unique:", len(set(train))),
("test:", len(test)),
("test unique:", len(set(test))),
("train/test overlap:", n_overlap,
"%.1f%%" % (100.0 * n_overlap / len(set(test)))),
]))
lines.append("\n")
print("\n".join(lines))
filipino
------------------- ---- ------
train: 4015
train unique: 3947
test: 4015
test unique: 3947
train/test overlap: 3947 100.0%
------------------- ---- ------
kirnews
------------------- ---- -----
train: 3689
train unique: 1791
test: 923
test unique: 698
train/test overlap: 631 90.4%
------------------- ---- -----
kinnews
------------------- ----- -----
train: 17014
train unique: 9199
test: 4254
test unique: 2702
train/test overlap: 643 23.8%
------------------- ----- -----
swahili
------------------- ----- ----
train: 22207
train unique: 22207
test: 7338
test unique: 7338
train/test overlap: 34 0.5%
------------------- ----- ----
I suggest the authors to download the DengueFilipino dataset from the original link instead of Hugging Face. I'm also working on some Tagalog pipelines and I noticed the same upload issues (basically the train and test are 1:1 match).
I wrote a parser and some personal notes (file docstring) here. The parser uses some spaCy primitives but feel free to use this as you see fit: https://github.com/ljvmiranda921/calamanCy/blob/master/reports/emnlp2023/benchmark/scripts/process_dengue.py
Hi @YannDubs, wow thanks for pointing this out!!! I was only aware of the dataset issue of DengueFilipino. Thanks @kts for verifying the huggingface dataset issue. People should be aware of that and use the original link for those datasets. I will redo the experiment on Filipino using the original link @ljvmiranda921 provided. I will also check if the issue of KirundiNews overlapped happened in their original dataset. Thanks again!
Here are results using the original DengueFilipino dataset. I also checked the original Kirundi dataset, it still has the data contamination issue.
我建议作者从原始链接下载 DengueFilipino 数据集,而不是Hugging Face。我也在研究一些 Tagalog 管道,我注意到了同样的上传问题(基本上训练和测试是 1:1 匹配的)。
我在这里编写了一个解析器和一些个人笔记(文件文档字符串)。解析器使用了一些 spaCy 原语,但您可以随意使用它:https://github.com/ljvmiranda921/calamanCy/blob/master/reports/emnlp2023/benchmark/scripts/process_dengue.py
Hello, may I ask if you can provide the Filipino dataset? Cannot download from the original link, thank you very much.