LongBench icon indicating copy to clipboard operation
LongBench copied to clipboard

Chinese Examples in MultiFieldQA-en

Open wendywangwwt opened this issue 1 year ago • 1 comments

Hi! I'm working on a long document QA problem and looked into the MultiFieldQA-en dataset recently.

I downloaded the dataset using the following code snippet:

from datasets import load_dataset

dataset = load_dataset("THUDM/LongBench",'multifieldqa_en')

While examining the content, I noticed that out of 150 entries, 2 are in Chinese rather than English: Screenshot 2024-05-05 at 4 27 36 PM.

Can you please take a look? Thank you!

wendywangwwt avatar May 05 '24 15:05 wendywangwwt

Hi! They are classified as English samples as they contain more English characters (a-zA-Z) than Chinese characters.

bys0318 avatar May 09 '24 08:05 bys0318