Yushi Bai

Results 102 comments of Yushi Bai

谢谢你的观察!我们会在日后推出verified版本数据集,将答案有误的题目改正。

谢谢提醒,已更新paper

Hi, we haven't tested DeepSeek-V3 on LongBench v2 yet. The current overall score was taken directly from the DeepSeek-V3 paper, which did not report the individual sub-scores.

Hi, we follow the design of [GPQA](https://arxiv.org/abs/2311.12022) for the w/o CoT mode and the w/ CoT mode. In w/ CoT mode, we first ask the model to generate its chain-of-thought...

谢谢你的建议。longbench中synthetic tasks是这样随机构造的,即我们把evidence的段落放在context的随机位置。在其他任务中,为了保证和真实场景分布一致,我们避免用这种人造方式改变原先的context。这种答案分布的bias在真实场景中也往往是存在的——例如文章的开头、末尾一般更加重要。

Thanks for your suggestion. We will consider updating LongBench.

Thanks for your keen observation. We sample the data directly from the test data of [Qasper](https://allenai.org/project/qasper/home), we suggest you ask the authors of Qasper.

This might be due to the model iteration. We tested the GPT-3.5-Turbo-16k at August, 2023. I think it has a different version now.

You're right. We want to emphasize the task instruction, so we insert the instruction at both the start and the end of the input.