ml-qrecc icon indicating copy to clipboard operation
ml-qrecc copied to clipboard

download this collection

Open bbei-z opened this issue 2 years ago • 8 comments

hi, when I download this collection of qrecc, it always returns an error of 503, so I want to know the size of the collection-paragraph that is splited collections into little. If it is not big enough, can you share it with us?

bbei-z avatar Jun 16 '23 10:06 bbei-z

were you able to resolve the issue? When you follow do you get 54M passages as mentioned? @RavitejaAnantha @tuzhucheng

wickcode avatar Oct 02 '24 11:10 wickcode

Sorry about the late reply. You can find a pre-built collection of passages here on AWS S3: aws s3 ls s3://mt-qrecc/collection-paragraph/.

tuzhucheng avatar Nov 18 '24 06:11 tuzhucheng

@tuzhucheng Access Denied when ls your S3, could you confirm?

BTW, the raw web pages can be downloaded from Zenodo (passages.zip).

hankcs avatar Dec 19 '24 23:12 hankcs

Hmm, I just tried to make it public again, please retry.

tuzhucheng avatar Jan 17 '25 19:01 tuzhucheng

Tried again, still Access Denied:

  • aws s3 ls s3://mt-qrecc/collection-paragraph/
  • curl https://mt-qrecc.s3.amazonaws.com/collection-paragraph/

hankcs avatar Jan 18 '25 01:01 hankcs

Hmm, what if you try the https url: https://mt-qrecc.s3.us-west-2.amazonaws.com/collection-paragraph/collection-paragraph.tar.gz.partaa? The file names are collection-paragraph.tar.gz.partaa to collection-paragraph.tar.gz.partaz (26 files).

tuzhucheng avatar Feb 05 '25 08:02 tuzhucheng

Hmm, what if you try the https url: https://mt-qrecc.s3.us-west-2.amazonaws.com/collection-paragraph/collection-paragraph.tar.gz.partaa? The file names are to (26 files).collection-paragraph.tar.gz.partaa``collection-paragraph.tar.gz.partaz

Hello, for the rounds where the "truth passage" field is not annotated, is it due to missing annotations, or is there another reason? For example, in the first round of dialogue 1 in the test set:

{
    "Answer_URL": "https://explorehealthcareers.org/career/medicine/physician-assistant/",
    "Context": [],
    "Conversation_no": 1,
    "Conversation_source": "trec",
    "Question": "What is a physician's assistant?",
    "Transformer_rewrite": "What is a physician's assistant",
    "Truth_answer": "physician assistants are medical providers who are licensed to diagnose and treat illness and disease and to prescribe medication for patients",
    "Truth_passages": [],
    "Truth_rewrite": "What is a physician's assistant?",
    "Turn_no": 1
}

lujiarui-iie avatar Mar 26 '25 06:03 lujiarui-iie

Hmm, what if you try the https url: https://mt-qrecc.s3.us-west-2.amazonaws.com/collection-paragraph/collection-paragraph.tar.gz.partaa? The file names are to (26 files). collection-paragraph.tar.gz.partaacollection-paragraph.tar.gz.partaz ``

Hello, for the rounds where the "truth passage" field is not annotated, is it due to missing annotations, or is there another reason? For example, in the first round of dialogue 1 in the test set:

{ "Answer_URL": "https://explorehealthcareers.org/career/medicine/physician-assistant/", "Context": [], "Conversation_no": 1, "Conversation_source": "trec", "Question": "What is a physician's assistant?", "Transformer_rewrite": "What is a physician's assistant", "Truth_answer": "physician assistants are medical providers who are licensed to diagnose and treat illness and disease and to prescribe medication for patients", "Truth_passages": [], "Truth_rewrite": "What is a physician's assistant?", "Turn_no": 1 }

There is another question I'd like to ask. Regarding the first turn of conversation 1 in the mentioned test set, its truth answer corresponds to the sentence in the paragraph at http://web.archive.org/web/20200106012242id_/https://explorehealthcareers.org/career/medicine/physician-assistant/_p0. However, the test set does not have a truth passage labeled for it.

lujiarui-iie avatar Mar 26 '25 07:03 lujiarui-iie