LongBench icon indicating copy to clipboard operation
LongBench copied to clipboard

The "anwser" for some examples in "qasper.jsonl" is strange

Open Zcchill opened this issue 1 year ago • 6 comments

I download the data from the offcial url and I found that the "answers" of several examples in "qasper.jsonl" are confusing. Here are several examples: {"pred": "No", "answers": ["Yes", "No"], "all_classes": null, "length": 2317, "input": "Does this method help in sentiment classification task improvement?", "_id": "bcfe56efad9715cc714ffd2e523eaa9ad796a453e7da77a6"} {"pred": "unanswerable", "answers": ["Yes", "Unanswerable"], "all_classes": null, "length": 2284, "actual_length": 3533, "input": "Is jiant compatible with models in any programming language?", "_id": "e5d1d589ddb30f43547012f04b06ac2924a1f4fdcf56daab"} {"pred": "BERTBase", "answers": ["BERTbase", "BERTbase"], "all_classes": null, "length": 3852, "actual_length": 5701, "input": "What BERT model do they test?", "_id": "2a51c07e65a9214ed2cd3c04303afa205e005f4e1ccb172a"}

Zcchill avatar Jul 09 '24 11:07 Zcchill

Another example: "_id": "d1aa1132439bd292965634095bf1c9943e062bb6645ff78c". The query is "how many tags do they look at?" The given answer seem to source from "We employ two sources of e-book annotation data: (i) editor tags, and (ii) Amazon search terms. For editor tags, we collect data of 48,705 e-books from 13 publishers, namely Kunstmann, Delius-Klasnig, VUR, HJR, Diogenes, Campus, Kiwi, Beltz, Chbeck, Rowohlt, Droemer, Fischer and Neopubli." But I think the answer of "30 tags" based on "As shown in Table TABREF3 , we collect Amazon review keywords for 2,896 e-books (publishers: Kiwi, Rowohlt, Fischer, and Droemer), which leads to 33,663 distinct review keywords and on average 30 keyword assignments per e-book.\nTag Recommendation Approaches" is better.

Zcchill avatar Jul 10 '24 06:07 Zcchill

Thanks for your keen observation. We sample the data directly from the test data of Qasper, we suggest you ask the authors of Qasper.

bys0318 avatar Jul 10 '24 07:07 bys0318

Besides, I would like to replicate the results of "GPT-3.5-Turbo-16k" in paper but get results not so close with the results reported in the paper. I wonder the possible reasons since there is no official code for api method. The results I get are as followed: { "2wikimqa": { "0-4k": 57.09, "4-8k": 42.82, "8k+": 32.71 }, "hotpotqa": { "0-4k": 68.44, "4-8k": 57.25, "8k+": 55.38 }, "multi_news": { "0-4k": 28.57, "4-8k": 23.34, "8k+": 22.31 }, "qasper": { "0-4k": 47.3, "4-8k": 43.97, "8k+": 28.35 }, "multifieldqa_en": { "0-4k": 57.15, "4-8k": 51.67, "8k+": 57.52 }, "gov_report": { "0-4k": 31.79, "4-8k": 28.82, "8k+": 27.34 } } Experiment setting:

  1. I use the api supported by AzureOpenAI.
  2. The system prompt is None. [{"role":"system","content":''}, {"role":"user","content":prompt}]
  3. inference hyper-parameters: completion = client.chat.completions.create( model="gpt-35-turbo-16k", messages=input, temperature=0.0, max_tokens=max_tokens, stop=stop_token, ) response = completion.choices[0].message.content

Zcchill avatar Jul 10 '24 16:07 Zcchill

This might be due to the model iteration. We tested the GPT-3.5-Turbo-16k at August, 2023. I think it has a different version now.

bys0318 avatar Jul 11 '24 08:07 bys0318

"You are given a scientific article and a question. Answer the question as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write "unanswerable". If the question is a yes/no question, answer "yes", "no", or "unanswerable". Do not provide any explanation.\n\nArticle: {context}\n\n Answer the question based on the above article as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write "unanswerable". If the question is a yes/no question, answer "yes", "no", or "unanswerable". Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:" The instruction for qasper tasks in dataset2prompt seems redundent, is this a mistake or a deliberate strategy to emphasize the task at both the beginning and the end of a long text (due to position bias)?

Zcchill avatar Jul 15 '24 05:07 Zcchill

You're right. We want to emphasize the task instruction, so we insert the instruction at both the start and the end of the input.

bys0318 avatar Jul 15 '24 06:07 bys0318