biomedical icon indicating copy to clipboard operation
biomedical copied to clipboard

Create dataset loader for TREC-2017 LiveQA

Open jason-fries opened this issue 2 years ago • 9 comments

Adding a Dataset

  • Name: TREC-2017 LiveQA
  • Description: None provided
  • Task: QA
  • Paper: https://trec.nist.gov/pubs/trec26/papers/Overview-QA.pdf
  • Data: https://github.com/abachaa/LiveQA_MedicalTask_TREC2017
  • License: ?

jason-fries avatar Mar 22 '22 00:03 jason-fries

#self-assign

luou-wen avatar Apr 01 '22 10:04 luou-wen

Hi @luou-wen can you let us know if you are still working on this so we can update our project board? Please just notify us the status by Friday April 8. You can response to this comment or ping us on Slack or Discord.

No worries if you are not finished but still intend to work on this!

jason-fries avatar Apr 07 '22 22:04 jason-fries

Hi @jason-fries sorry for the late response. I did not see this message until today. If possible, may I pick this back up? I was intending to finish the dataloader and make a pull request today.

luou-wen avatar Apr 10 '22 08:04 luou-wen

Hi @luou-wen yes of course! I just re-assigned you.

hakunanatasha avatar Apr 10 '22 16:04 hakunanatasha

@hakunanatasha Thank you very much! I will continue working on it and make a pull request asap.

luou-wen avatar Apr 10 '22 18:04 luou-wen

Hi @luou-wen, Just a ping on the status of this dataset. Please let us know if you are still working on it and when you plan to submit a PR. Thanks!!

jason-fries avatar Apr 19 '22 22:04 jason-fries

Hi @jason-fries, Apologies for the delay. I am still working on it, and I will submit a PR by this Sunday at the latest.

luou-wen avatar Apr 19 '22 22:04 luou-wen

#self-assign

shamikbose avatar Jun 07 '22 20:06 shamikbose

@hakunanatasha @jason-fries I have a couple of questions about this dataset:

  1. This dataset has multiple answers for the same question. The bigbio_qa schema has one answer per question. Should I create multiple uids for same questions with different answers?
  2. For QA tasks, it seems like they are framed as (question, context, answer) where the answer is supposed to be in the context. This dataset doesn't seem to have a context for the annotations. Sample from one of the documents:
<SUBJECT></SUBJECT>
	<MESSAGE>Literature on Cardiac amyloidosis.  Please let me know where I can get literature on Cardiac amyloidosis.  My uncle died yesterday from this disorder.  Since this is such a rare disorder, and to honor his memory, I would like to distribute literature at his funeral service.  I am a retired NIH employee, so I am familiar with the campus in case you have literature at NIH that I can come and pick up.  Thank you </MESSAGE>
	<SUB-QUESTIONS>
		<SUB-QUESTION subqid="Q1-S1">
			<ANNOTATIONS>
				<FOCUS>cardiac amyloidosis</FOCUS>
				<TYPE>information</TYPE>
			</ANNOTATIONS>
			<ANSWERS>
				<ANSWER answerid="Q1-S1-A1" pairid="1">Cardiac amyloidosis is a disorder caused by deposits of an abnormal protein (amyloid) in the heart tissue. These deposits make it hard for the heart to work properly.</ANSWER>
				<ANSWER answerid="Q1-S1-A2" pairid="2">The term "amyloidosis" refers not to a single disease but to a collection of diseases in which a protein-based infiltrate deposits in tissues as beta-pleated sheets. The subtype of the disease is determined by which protein is depositing; although dozens of subtypes have been described, most are incredibly rare or of trivial importance. This analysis will focus on the main systemic forms of amyloidosis, both of which frequently involve the heart.</ANSWER>
			</ANSWERS>
		</SUB-QUESTION>
	</SUB-QUESTIONS>

shamikbose avatar Jun 08 '22 15:06 shamikbose