MOSS-RLHF icon indicating copy to clipboard operation
MOSS-RLHF copied to clipboard

关于rm中lm loss计算的疑问

Open DZ9 opened this issue 1 year ago • 5 comments

reward_trainer.py这里,删除了lm_logits中最后一个token的概率分布,但是在下面的label里面是删除了第一个词,想问下这里是怎么对应的呢 image

DZ9 avatar Feb 05 '24 07:02 DZ9

Wow. It looks like the issue is in the original huggingface datasets. Lots of duplicates too.

Here are the stats using only huggingface datasets.load_dataset():

from datasets import load_dataset
from tabulate import tabulate

# dataset info from data.py:
# [name, args to load_dataset(), keys used on each item]
ds_info = [

    ("filipino",
     ['dengue_filipino'],
     ['text', 'absent', 'dengue', 'health', 'mosquito', 'sick']),

    ("kirnews",
     ["kinnews_kirnews","kirnews_cleaned"],
     ['label','title','content']),
    
    ("kinnews",
     ["kinnews_kirnews", "kinnews_cleaned"],
     ['label','title','content']),

    ("swahili",
     ['swahili_news'],
     ['label','text']),

]

lines = []
for name,args,keys in ds_info:

    ds = load_dataset(*args)

    # convert to list-of-tuples:
    train = [tuple([item[key] for key in keys]) for item in ds['train']]
    test  = [tuple([item[key] for key in keys]) for item in ds['test']]

    lines.append(name)

    n_overlap = len(set(train).intersection(test))
    lines.append(tabulate([
        ("train:",        len(train)),
        ("train unique:", len(set(train))),
        ("test:",         len(test)),
        ("test unique:",  len(set(test))),
        
        ("train/test overlap:", n_overlap,
         "%.1f%%" % (100.0 * n_overlap / len(set(test)))),
    ]))
    lines.append("\n")
    
print("\n".join(lines))

filipino
-------------------  ----  ------
train:               4015
train unique:        3947
test:                4015
test unique:         3947
train/test overlap:  3947  100.0%
-------------------  ----  ------


kirnews
-------------------  ----  -----
train:               3689
train unique:        1791
test:                 923
test unique:          698
train/test overlap:   631  90.4%
-------------------  ----  -----


kinnews
-------------------  -----  -----
train:               17014
train unique:         9199
test:                 4254
test unique:          2702
train/test overlap:    643  23.8%
-------------------  -----  -----


swahili
-------------------  -----  ----
train:               22207
train unique:        22207
test:                 7338
test unique:          7338
train/test overlap:     34  0.5%
-------------------  -----  ----

kts avatar Jul 18 '23 16:07 kts

I suggest the authors to download the DengueFilipino dataset from the original link instead of Hugging Face. I'm also working on some Tagalog pipelines and I noticed the same upload issues (basically the train and test are 1:1 match).

I wrote a parser and some personal notes (file docstring) here. The parser uses some spaCy primitives but feel free to use this as you see fit: https://github.com/ljvmiranda921/calamanCy/blob/master/reports/emnlp2023/benchmark/scripts/process_dengue.py

ljvmiranda921 avatar Jul 18 '23 23:07 ljvmiranda921

Hi @YannDubs, wow thanks for pointing this out!!! I was only aware of the dataset issue of DengueFilipino. Thanks @kts for verifying the huggingface dataset issue. People should be aware of that and use the original link for those datasets. I will redo the experiment on Filipino using the original link @ljvmiranda921 provided. I will also check if the issue of KirundiNews overlapped happened in their original dataset. Thanks again!

bazingagin avatar Jul 20 '23 20:07 bazingagin

Screenshot 2023-07-31 at 10 10 06 PM

Here are results using the original DengueFilipino dataset. I also checked the original Kirundi dataset, it still has the data contamination issue.

bazingagin avatar Aug 01 '23 02:08 bazingagin

我建议作者从原始链接下载 DengueFilipino 数据集,而不是Hugging Face。我也在研究一些 Tagalog 管道,我注意到了同样的上传问题(基本上训练和测试是 1:1 匹配的)。

我在这里编写了一个解析器和一些个人笔记(文件文档字符串)。解析器使用了一些 spaCy 原语,但您可以随意使用它:https://github.com/ljvmiranda921/calamanCy/blob/master/reports/emnlp2023/benchmark/scripts/process_dengue.py

Hello, may I ask if you can provide the Filipino dataset? Cannot download from the original link, thank you very much.

maoxuxu avatar Jul 18 '24 11:07 maoxuxu