Yushi Bai comments

Results 102 comments of


                                            Yushi Bai

数据集问题：ID 66f39ac5821e116aacb2da81 的答案无法在 context 中找到

谢谢你的观察！我们会在日后推出verified版本数据集，将答案有误的题目改正。

zero的paper没有更新

谢谢提醒，已更新paper

DeepSeek V3 Result

Hi, we haven't tested DeepSeek-V3 on LongBench v2 yet. The current overall score was taken directly from the DeepSeek-V3 paper, which did not report the individual sub-scores.

w/ cot mode for "thinking" models

Hi, we follow the design of [GPQA](https://arxiv.org/abs/2311.12022) for the w/o CoT mode and the w/ CoT mode. In w/ CoT mode, we first ask the model to generate its chain-of-thought...

Add evaluation script for memory-augmented models (A-Mem, Mem0, etc.) on LongBench & LongBench v2

Great feature! You’re welcome to submit a PR for it.

关于提升数据集测试有效性的建议

谢谢你的建议。longbench中synthetic tasks是这样随机构造的，即我们把evidence的段落放在context的随机位置。在其他任务中，为了保证和真实场景分布一致，我们避免用这种人造方式改变原先的context。这种答案分布的bias在真实场景中也往往是存在的——例如文章的开头、末尾一般更加重要。

关于提升数据集测试有效性的建议

Thanks for your suggestion. We will consider updating LongBench.

The "anwser" for some examples in "qasper.jsonl" is strange

Thanks for your keen observation. We sample the data directly from the test data of [Qasper](https://allenai.org/project/qasper/home), we suggest you ask the authors of Qasper.

The "anwser" for some examples in "qasper.jsonl" is strange

This might be due to the model iteration. We tested the GPT-3.5-Turbo-16k at August, 2023. I think it has a different version now.

The "anwser" for some examples in "qasper.jsonl" is strange

You're right. We want to emphasize the task instruction, so we insert the instruction at both the start and the end of the input.